Databricks Community

397973 · ‎02-29-2024

Hi. In Databricks workflows, I submit a spark job (Type = "Spark Submit"), and a bunch of parameters, starting with --py-files.

This works where all the files are in the same s3 path, but I get errors when I put a "common" module in a different s3 path:

"--py-files",
"s3://some_path/appl_src.py",
"s3://some_path/main.py",
"s3://a_different_path/common.py",

I get an error saying "common" doesn't exist, when I know in fact the path exists. From Standard output:

Traceback (most recent call last):
File "/local_disk0/tmp/spark-123/appl_src.py", line 21, in <module>
from common import my_functions
ModuleNotFoundError: No module named 'common'

Additionally log4j mentions the first two, but not the third:

24/02/28 21:41:00 INFO Utils: Fetching s3://some_path/appl_src.py to ...
24/02/28 21:41:00 INFO Utils: Fetching s3://some_path/main.py to ...

Why does Spark ignore the third argument? Or it has to be in the same s3 path?

MichTalebzadeh · ‎03-06-2024

This below is catered for yarn mode

if your application code primarily consists of Python files and does not require a separate virtual environment with specific dependencies, you can use the --py-files argument in spark-submit

spark-submit --verbose \
   --master yarn \
  --deploy-mode cluster \
  --name $APPNAME \
  --driver-memory 1g \  # Adjust memory as needed
  --executor-memory 1g \  # Adjust memory as needed
  --num-executors 2 \     # Adjust executors as needed
  --py-files ${build_directory}/source_code.zip \
  $CODE_DIRECTORY_CLOUD/my_application_entry_point.py  # Path to your main application script

For application code with a separate virtual environment)

If your application code has specific dependencies that you manage in a separate virtual environment, you can leverage the --conf spark.yarn.dist.archives argument.

spark-submit --verbose \

-master yarn \

-deploy-mode cluster \

--name $APPNAME \

--driver-memory 1g \ # Adjust memory as needed

--executor-memory 1g \ # Adjust memory as needed

--num-executors 2 \ # Adjust executors as needed-

-conf "spark.yarn.dist.archives"=${pyspark_venv}.tar.gz#pyspark_venv \

$CODE_DIRECTORY_CLOUD/my_application_entry_point.py # Path to your main application script

Explanation:

--conf "spark.yarn.dist.archives"=${pyspark_venv}.tar.gz#pyspark_venv: This configures Spark to distribute your virtual environment archive (pyspark_venv.tar.gz) to the Yarn cluster nodes. The #pyspark_venv part defines a symbolic link name within the container.
You do not need --py-fileshere because the virtual environment archive will contain all necessary dependencies.

Choosing the best approach:

The choice depends on your project setup:

No Separate Virtual Environment: Use --py-files if your application code consists mainly of Python files and doesn't require a separate virtual environment.
Separate Virtual Environment: Use --conf spark.yarn.dist.archives if you manage dependencies in a separate virtual environment archive.

HTH

Mich Talebzadeh,

Dad | Technologist | Solutions Architect | Engineer

London

United Kingdom

Mich Talebzadeh | Technologist | Data | Generative AI | Financial Fraud
London
United Kingdom

view my Linkedin profile

https://en.everybodywiki.com/Mich_Talebzadeh

Disclaimer: The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner Von Braun)".

MichTalebzadeh · ‎03-06-2024

OK this one is for k8s for Google cloud. However, you can adjust it to any cloud vendor

I use zip file personally and pass the application name (in your case main.py) as the last input line like below

APPLICATION is your main.py. It does not need to be called main.py. It could be anything like testpython.py

CODE_DIRECTORY_CLOUD="gs://spark-on-k8s/codes" ## replace gs with s3

# zip needs to be done at root directory of code

zip -rq ${source_code}.zip ${source_code}

gsutil cp ${source_code}.zip $CODE_DIRECTORY_CLOUD ## replace gsutil with aws s3

gsutil cp /${source_code}/src/${APPLICATION} $CODE_DIRECTORY_CLOUD

your spark job

spark-submit --verbose \

--properties-file ${property_file} \

--master k8s://https://$KUBERNETES_MASTER_IP:443 \

--deploy-mode cluster \

--name $APPNAME \

--py-files $CODE_DIRECTORY_CLOUD/spark_on_gke.zip \

--conf spark.kubernetes.namespace=$NAMESPACE \

--conf spark.network.timeout=300 \

--conf spark.kubernetes.allocation.batch.size=3 \

--conf spark.kubernetes.allocation.batch.delay=1 \

--conf spark.kubernetes.driver.container.image=${IMAGEDRIVER} \

--conf spark.kubernetes.executor.container.image=${IMAGEDRIVER} \

--conf spark.kubernetes.driver.pod.name=$APPNAME \

--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-bq \

--conf spark.driver.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true" \

--conf spark.executor.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true" \

--conf spark.dynamicAllocation.enabled=true \

--conf spark.dynamicAllocation.shuffleTracking.enabled=true \

--conf spark.dynamicAllocation.shuffleTracking.timeout=20s \

--conf spark.dynamicAllocation.executorIdleTimeout=30s \

--conf spark.dynamicAllocation.cachedExecutorIdleTimeout=40s \

--conf spark.dynamicAllocation.minExecutors=0 \

--conf spark.dynamicAllocation.maxExecutors=20 \

--conf spark.driver.cores=3 \

--conf spark.executor.cores=3 \

--conf spark.driver.memory=1024m \

--conf spark.executor.memory=1024m \

$CODE_DIRECTORY_CLOUD/${APPLICATION}

HTH

Mich Talebzadeh,

Dad | Technologist | Solutions Architect | Engineer

London

United Kingdom

Mich Talebzadeh | Technologist | Data | Generative AI | Financial Fraud
London
United Kingdom

view my Linkedin profile

https://en.everybodywiki.com/Mich_Talebzadeh

Disclaimer: The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner Von Braun)".

Databricks Community

Spark submit - not reading one of my --py-files arguments

Photos

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!