파이썬 셸에서 pyspark 가져 오기

developer tip

파이썬 셸에서 pyspark 가져 오기

copycodes 2020. 8. 17. 09:13

파이썬 셸에서 pyspark 가져 오기

이것은 답변이없는 다른 포럼에있는 다른 사람의 질문의 사본이므로 동일한 문제가 있으므로 여기서 다시 질문 할 것이라고 생각했습니다. ( http://geekple.com/blogs/feeds/Xgzu7/posts/351703064084736 참조 )

내 컴퓨터에 Spark가 제대로 설치되어 있고 ./bin/pyspark를 Python 인터프리터로 사용할 때 오류없이 pyspark 모듈로 Python 프로그램을 실행할 수 있습니다.

그러나 일반 Python 셸을 실행하려고 할 때 pyspark 모듈을 가져 오려고하면이 오류가 발생합니다.

from pyspark import SparkContext

그리고 그것은 말한다

"No module named pyspark".

이 문제를 어떻게 해결할 수 있습니까? Python이 pyspark 헤더 / 라이브러리 / 등을 가리 키도록 설정해야하는 환경 변수가 있습니까? 내 Spark 설치가 / spark / 인 경우 어떤 pyspark 경로를 포함해야합니까? 아니면 pyspark 프로그램은 pyspark 인터프리터에서만 실행할 수 있습니까?

다음은 간단한 방법입니다 (작동 방식에 대해 신경 쓰지 않으면 !!!).

findspark 사용

파이썬 셸로 이동

pip install findspark

import findspark
findspark.init()

필요한 모듈 가져 오기

from pyspark import SparkContext
from pyspark import SparkConf

끝난!!!

이러한 오류가 인쇄되는 경우 :

ImportError : py4j.java_gateway라는 모듈이 없습니다.

$ SPARK_HOME / python / build를 PYTHONPATH에 추가하십시오.

export SPARK_HOME=/Users/pzhang/apps/spark-1.1.0-bin-hadoop2.4
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH

pyspark bin이 LOADING python이고 올바른 라이브러리 경로를 자동으로로드하는 것으로 나타났습니다. $ SPARK_HOME / bin / pyspark 확인 :

# Add the PySpark classes to the Python path:
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH

이 줄을 .bashrc 파일에 추가했고 이제 모듈을 올바르게 찾았습니다!

py 파일을 다음과 같이 실행 python filename.py하지 마십시오. 대신 다음 을 사용하십시오.spark-submit filename.py

SPARK 경로와 Py4j 경로를 내 보내면 작동하기 시작했습니다.

export SPARK_HOME=/usr/local/Cellar/apache-spark/1.5.1
export PYTHONPATH=$SPARK_HOME/libexec/python:$SPARK_HOME/libexec/python/build:$PYTHONPATH
PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH 
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH

따라서 Python 셸을 실행할 때마다 입력하지 않으려면 .bashrc파일에 추가 할 수 있습니다.

Mac에서는 Homebrew를 사용하여 Spark (공식 "apache-spark")를 설치합니다. 그런 다음 Python 가져 오기가 작동하도록 PYTHONPATH를 다음과 같이 설정합니다.

export SPARK_HOME=/usr/local/Cellar/apache-spark/1.2.0
export PYTHONPATH=$SPARK_HOME/libexec/python:$SPARK_HOME/libexec/python/build:$PYTHONPATH

Mac에서 "1.2.0"을 실제 apache-spark 버전으로 바꿉니다.

pyspark에서 Spark를 실행하려면 함께 작동하는 데 두 가지 구성 요소가 필요합니다.

pyspark 파이썬 패키지
JVM의 Spark 인스턴스

spark-submit 또는 pyspark로 항목을 시작할 때이 스크립트는 둘 다 처리합니다. 즉, 스크립트가 pyspark를 찾을 수 있도록 PYTHONPATH, PATH 등을 설정하고, 매개 변수에 따라 구성하여 Spark 인스턴스도 시작합니다. , 예 : --master X

또는 이러한 스크립트를 우회하고 .NET과 같은 Python 인터프리터에서 직접 Spark 애플리케이션을 실행할 수 있습니다 python myscript.py. 이것은 스파크 스크립트가 더 복잡해지기 시작하고 결국 자체 인수를받을 때 특히 흥미 롭습니다.

Python 인터프리터에서 pyspark 패키지를 찾을 수 있는지 확인합니다. 이미 논의했듯이 PYTHONPATH에 spark / python dir을 추가하거나 pip install을 사용하여 pyspark를 직접 설치합니다.
스크립트에서 스파크 인스턴스의 매개 변수를 설정합니다 (pyspark에 전달되는 데 사용 된 것).
- 일반적으로 --conf로 설정하는 스파크 구성의 경우 SparkSession.builder.config의 구성 개체 (또는 문자열 구성)로 정의됩니다.
- For main options (like --master, or --driver-mem) for the moment you can set them by writing to the PYSPARK_SUBMIT_ARGS environment variable. To make things cleaner and safer you can set it from within Python itself, and spark will read it when starting.
Start the instance, which just requires you to call getOrCreate() from the builder object.

Your script can therefore have something like this:

from pyspark.sql import SparkSession

if __name__ == "__main__":
    if spark_main_opts:
        # Set main options, e.g. "--master local[4]"
        os.environ['PYSPARK_SUBMIT_ARGS'] = spark_main_opts + " pyspark-shell"

    # Set spark config
    spark = (SparkSession.builder
             .config("spark.checkpoint.compress", True)
             .config("spark.jars.packages", "graphframes:graphframes:0.5.0-spark2.1-s_2.11")
             .getOrCreate())

To get rid of ImportError: No module named py4j.java_gateway, you need to add following lines:

import os
import sys


os.environ['SPARK_HOME'] = "D:\python\spark-1.4.1-bin-hadoop2.4"


sys.path.append("D:\python\spark-1.4.1-bin-hadoop2.4\python")
sys.path.append("D:\python\spark-1.4.1-bin-hadoop2.4\python\lib\py4j-0.8.2.1-src.zip")

try:
    from pyspark import SparkContext
    from pyspark import SparkConf

    print ("success")

except ImportError as e:
    print ("error importing spark modules", e)
    sys.exit(1)

On Windows 10 the following worked for me. I added the following environment variables using Settings > Edit environment variables for your account:

SPARK_HOME=C:\Programming\spark-2.0.1-bin-hadoop2.7
PYTHONPATH=%SPARK_HOME%\python;%PYTHONPATH%

(change "C:\Programming\..." to the folder in which you have installed spark)

For Linux users, the following is the correct (and non-hard-coded) way of including the pyspark libaray in PYTHONPATH. Both PATH parts are necessary:

The path to the pyspark Python module itself, and
The path to the zipped library that that pyspark module relies on when imported

Notice below that the zipped library version is dynamically determined, so we do not hard-code it.

export PYTHONPATH=${SPARK_HOME}/python/:$(echo ${SPARK_HOME}/python/lib/py4j-*-src.zip):${PYTHONPATH}

I am running a spark cluster, on CentOS VM, which is installed from cloudera yum packages.

Had to set the following variables to run pyspark.

export SPARK_HOME=/usr/lib/spark;
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.9-src.zip:$PYTHONPATH

export PYSPARK_PYTHON=/home/user/anaconda3/bin/python
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'

This is what I did for using my Anaconda distribution with Spark. This is Spark version independent. You can change the first line to your users' python bin. Also, as of Spark 2.2.0 PySpark is available as a Stand-alone package on PyPi but I am yet to test it out.

I had the same problem.

Also make sure you are using right python version and you are installing it with right pip version. in my case: I had both python 2.7 and 3.x. I have installed pyspark with

pip2.7 install pyspark

and it worked.

I got this error because the python script I was trying to submit was called pyspark.py (facepalm). The fix was to set my PYTHONPATH as recommended above, then rename the script to pyspark_test.py and clean up the pyspark.pyc that was created based on my scripts original name and that cleared this error up.

In the case of DSE (DataStax Cassandra & Spark) The following location needs to be added to PYTHONPATH

export PYTHONPATH=/usr/share/dse/resources/spark/python:$PYTHONPATH

Then use the dse pyspark to get the modules in path.

dse pyspark

I had this same problem and would add one thing to the proposed solutions above. When using Homebrew on Mac OS X to install Spark you will need to correct the py4j path address to include libexec in the path (remembering to change py4j version to the one you have);

PYTHONPATH=$SPARK_HOME/libexec/python/lib/py4j-0.9-src.zip:$PYTHONPATH

You can also create a Docker container with Alpine as the OS and the install Python and Pyspark as packages. That will have it all containerised.

You can get the pyspark path in python using pip (if you have installed pyspark using PIP) as below

pip show pyspark

In my case it was getting install at a different python dist_package (python 3.5) whereas I was using python 3.6, so the below helped:

python -m pip install pyspark

참고URL : https://stackoverflow.com/questions/23256536/importing-pyspark-in-python-shell

'developer tip' 카테고리의 다른 글

SQL-처음 10 개 행만 선택 하시겠습니까? (0)	2020.08.17
PHP를 사용하여 JSON 파일에서 데이터 가져 오기 (0)	2020.08.17
Kotlin을 사용하여 Android에서 Parcelable 데이터 클래스를 만드는 편리한 방법이 있나요? (0)	2020.08.17
Powershell에서 "@"기호는 무엇을합니까? (0)	2020.08.17
SSH를 로그 아웃 한 후에도 백그라운드에서 Python 스크립트를 실행하는 방법은 무엇입니까? (0)	2020.08.17

현재글파이썬 셸에서 pyspark 가져 오기

copycodes

파이썬 셸에서 pyspark 가져 오기

파이썬 셸에서 pyspark 가져 오기

'developer tip' 카테고리의 다른 글

'developer tip'의 다른글

티스토리툴바

파이썬 셸에서 pyspark 가져 오기

파이썬 셸에서 pyspark 가져 오기

'developer tip' 카테고리의 다른 글

'developer tip'의 다른글

관련글

티스토리툴바