cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Spark submit - not reading one of my --py-files arguments

397973
New Contributor III

Hi. In Databricks workflows, I submit a spark job (Type = "Spark Submit"), and a bunch of parameters, starting with --py-files.

This works where all the files are in the same s3 path, but I get errors when I put a "common" module in a different s3 path:

"--py-files",
"s3://some_path/appl_src.py",
"s3://some_path/main.py",
"s3://a_different_path/common.py",

I get an error saying "common" doesn't exist, when I know in fact the path exists. From Standard output:

Traceback (most recent call last):
File "/local_disk0/tmp/spark-123/appl_src.py", line 21, in <module>
from common import my_functions
ModuleNotFoundError: No module named 'common'

Additionally log4j mentions the first two, but not the third:

24/02/28 21:41:00 INFO Utils: Fetching s3://some_path/appl_src.py to ...
24/02/28 21:41:00 INFO Utils: Fetching s3://some_path/main.py to ...

Why does Spark ignore the third argument? Or it has to be in the same s3 path? 

2 REPLIES 2

MichTalebzadeh
Valued Contributor
 This below is catered for yarn mode

if your application code primarily consists of Python files and does not require a separate virtual environment with specific dependencies, you can use the --py-files argument in spark-submit

spark-submit --verbose \
   --master yarn \
  --deploy-mode cluster \
  --name $APPNAME \
  --driver-memory 1g \  # Adjust memory as needed
  --executor-memory 1g \  # Adjust memory as needed
  --num-executors 2 \     # Adjust executors as needed
  --py-files ${build_directory}/source_code.zip \
  $CODE_DIRECTORY_CLOUD/my_application_entry_point.py  # Path to your main application script

For application code with a separate virtual environment)

If your application code has specific dependencies that you manage in a separate virtual environment, you can leverage the --conf spark.yarn.dist.archives argument.

spark-submit --verbose \ 
-master yarn \ 
-deploy-mode cluster \ 
--name $APPNAME \
 --driver-memory 1g \ # Adjust memory as needed 
--executor-memory 1g \ # Adjust memory as needed 
--num-executors 2 \ # Adjust executors as needed-
-conf "spark.yarn.dist.archives"=${pyspark_venv}.tar.gz#pyspark_venv \
$CODE_DIRECTORY_CLOUD/my_application_entry_point.py # Path to your main application script

Explanation:

  • --conf "spark.yarn.dist.archives"=${pyspark_venv}.tar.gz#pyspark_venv: This configures Spark to distribute your virtual environment archive (pyspark_venv.tar.gz) to the Yarn cluster nodes. The #pyspark_venv  part defines a symbolic link name within the container.
  • You do not need --py-fileshere because the virtual environment archive will contain all necessary dependencies.

Choosing the best approach:

The choice depends on your project setup:

  • No Separate Virtual Environment: Use  --py-files if your application code consists mainly of Python files and doesn't require a separate virtual environment.
  • Separate Virtual Environment: Use --conf spark.yarn.dist.archives if you manage dependencies in a separate virtual environment archive.

HTH


Mich Talebzadeh,

Dad | Technologist | Solutions Architect | Engineer

London

United Kingdom

 

Mich Talebzadeh | Technologist | Data | Generative AI | Financial Fraud
London
United Kingdom

view my Linkedin profile



https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner Von Braun)".

OK this one is for k8s for Google cloud. However, you can adjust it to any cloud vendor

I use zip file personally and pass the application name (in your case main.py) as the last input line like below

 

APPLICATION is your main.py. It does not need to be called main.py. It could be anything like  testpython.py

 

CODE_DIRECTORY_CLOUD="gs://spark-on-k8s/codes"   ## replace gs with s3

# zip needs to be done at root directory of code

zip -rq ${source_code}.zip ${source_code}

gsutil cp ${source_code}.zip $CODE_DIRECTORY_CLOUD  ## replace gsutil with aws s3

gsutil cp /${source_code}/src/${APPLICATION} $CODE_DIRECTORY_CLOUD

 

your spark job

 

 spark-submit --verbose \

           --properties-file ${property_file} \

           --master k8s://https://$KUBERNETES_MASTER_IP:443 \

           --deploy-mode cluster \

           --name $APPNAME \

           --py-files $CODE_DIRECTORY_CLOUD/spark_on_gke.zip \

           --conf spark.kubernetes.namespace=$NAMESPACE \

           --conf spark.network.timeout=300 \

           --conf spark.kubernetes.allocation.batch.size=3 \

           --conf spark.kubernetes.allocation.batch.delay=1 \

           --conf spark.kubernetes.driver.container.image=${IMAGEDRIVER} \

           --conf spark.kubernetes.executor.container.image=${IMAGEDRIVER} \

           --conf spark.kubernetes.driver.pod.name=$APPNAME \

           --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-bq \

           --conf spark.driver.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true" \

           --conf spark.executor.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true" \

           --conf spark.dynamicAllocation.enabled=true \

           --conf spark.dynamicAllocation.shuffleTracking.enabled=true \

           --conf spark.dynamicAllocation.shuffleTracking.timeout=20s \

           --conf spark.dynamicAllocation.executorIdleTimeout=30s \

           --conf spark.dynamicAllocation.cachedExecutorIdleTimeout=40s \

           --conf spark.dynamicAllocation.minExecutors=0 \

           --conf spark.dynamicAllocation.maxExecutors=20 \

           --conf spark.driver.cores=3 \

           --conf spark.executor.cores=3 \

           --conf spark.driver.memory=1024m \

           --conf spark.executor.memory=1024m \

           $CODE_DIRECTORY_CLOUD/${APPLICATION}

 

HTH

 

Mich Talebzadeh,

Dad | Technologist | Solutions Architect | Engineer

London

United Kingdom

 

 

MichTalebzadeh_0-1709754798807.jpeg

 

 

Mich Talebzadeh | Technologist | Data | Generative AI | Financial Fraud
London
United Kingdom

view my Linkedin profile



https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner Von Braun)".

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group