cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Difference between running pyspark code by using commend python3 and pyspark

twotwoiscute
New Contributor

I am confused by what's difference between running code using command

python3 CODENAME.py
and launch it by commend
pyspark
and start working on the code.

When I run the code :

spark = SparkSession.builder.config("spark.driver.memory", "16").appName("EDA").getOrCreate()

The first way

python3 CODENAME.py
raises the error even if I have already done

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export SPARK_HOME=/home/twotwo/anaconda3/envs/yolov5/lib/python3.8/site-packages/pyspark
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.9-src.zip:$PYTHONPATH
export PATH=$SPARK_HOME/python:$PATH

the error mseeage look like :

Exception: Java gateway process exited before sending its port number

However ,the second way runs the code without any problem , I would like to know what's the difference between these two ways.Thanks

2 REPLIES 2

Kaniz
Community Manager
Community Manager

Hi @twotwoiscute​  ! My name is Kaniz, and I'm a technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers on the Forum have an answer to your questions first. Or else I will follow up shortly with a response.

Kaniz
Community Manager
Community Manager

Using spark-submit and pyspark command you can run the spark statements, both these commands are available at $SPARK_HOME/bin directory and you will find two sets of these commands .sh files for Linux/macOS and .cmd files for windows.

If you are using EMR , there are three things

1.using pyspark(or spark-shell)

2.using spark-submit without using --master and --deploy-mode

3.using spark-submit and using --master and --deploy-mode

Although using all the above three will run the application in spark cluster, there is a difference how the driver program works.

In 1st and 2nd the driver will be in client mode whereas in 3rd the driver will also be in the cluster.

In 1st and 2nd, you will have to wait until one application complete to run another, but in 3rd you can run multiple applications in parallel.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.