topic Re: Spark submit - not reading one of my --py-files arguments in Data Engineering

Spark submit - not reading one of my --py-files arguments

397973 — Thu, 29 Feb 2024 15:02:46 GMT

Hi. In Databricks workflows, I submit a spark job (Type = "Spark Submit"), and a bunch of parameters, starting with --py-files.

This works where all the files are in the same s3 path, but I get errors when I put a "common" module in a different s3 path:

"--py-files",
"s3://some_path/appl_src.py",
"s3://some_path/main.py",
"s3://a_different_path/common.py",

I get an error saying "common" doesn't exist, when I know in fact the path exists. From Standard output:

Traceback (most recent call last):
File "/local_disk0/tmp/spark-123/appl_src.py", line 21, in <module>
from common import my_functions
ModuleNotFoundError: No module named 'common'

Additionally log4j mentions the first two, but not the third:

24/02/28 21:41:00 INFO Utils: Fetching s3://some_path/appl_src.py to ...
24/02/28 21:41:00 INFO Utils: Fetching s3://some_path/main.py to ...

Why does Spark ignore the third argument? Or it has to be in the same s3 path?

Re: Spark submit - not reading one of my --py-files arguments

MichTalebzadeh — Wed, 06 Mar 2024 19:36:36 GMT

This below is catered for yarn mode

if your application code primarily consists of Python files and does not require a separate virtual environment with specific dependencies, you can use the --py-files argument in spark-submit

spark-submit --verbose \
   --master yarn \
  --deploy-mode cluster \
  --name $APPNAME \
  --driver-memory 1g \  # Adjust memory as needed
  --executor-memory 1g \  # Adjust memory as needed
  --num-executors 2 \     # Adjust executors as needed
  --py-files ${build_directory}/source_code.zip \
  $CODE_DIRECTORY_CLOUD/my_application_entry_point.py  # Path to your main application script

For application code with a separate virtual environment)

If your application code has specific dependencies that you manage in a separate virtual environment, you can leverage the --conf spark.yarn.dist.archives argument.

spark-submit --verbose \

-master yarn \

-deploy-mode cluster \

--name $APPNAME \

--driver-memory 1g \ # Adjust memory as needed

--executor-memory 1g \ # Adjust memory as needed

--num-executors 2 \ # Adjust executors as needed-

-conf "spark.yarn.dist.archives"=${pyspark_venv}.tar.gz#pyspark_venv \

$CODE_DIRECTORY_CLOUD/my_application_entry_point.py # Path to your main application script

Explanation:

--conf "spark.yarn.dist.archives"=${pyspark_venv}.tar.gz#pyspark_venv: This configures Spark to distribute your virtual environment archive (pyspark_venv.tar.gz) to the Yarn cluster nodes. The #pyspark_venv part defines a symbolic link name within the container.
You do not need --py-fileshere because the virtual environment archive will contain all necessary dependencies.

Choosing the best approach:

The choice depends on your project setup:

No Separate Virtual Environment: Use --py-files if your application code consists mainly of Python files and doesn't require a separate virtual environment.
Separate Virtual Environment: Use --conf spark.yarn.dist.archives if you manage dependencies in a separate virtual environment archive.

HTH

Mich Talebzadeh,

Dad | Technologist | Solutions Architect | Engineer

London

United Kingdom

Re: Spark submit - not reading one of my --py-files arguments

MichTalebzadeh — Wed, 06 Mar 2024 19:57:36 GMT

OK this one is for k8s for Google cloud. However, you can adjust it to any cloud vendor

I use zip file personally and pass the application name (in your case main.py) as the last input line like below

APPLICATION is your main.py. It does not need to be called main.py. It could be anything like testpython.py

CODE_DIRECTORY_CLOUD="gs://spark-on-k8s/codes" ## replace gs with s3

# zip needs to be done at root directory of code

zip -rq ${source_code}.zip ${source_code}

gsutil cp ${source_code}.zip $CODE_DIRECTORY_CLOUD ## replace gsutil with aws s3

gsutil cp /${source_code}/src/${APPLICATION} $CODE_DIRECTORY_CLOUD

your spark job

spark-submit --verbose \

--properties-file ${property_file} \

--master k8s://https://$KUBERNETES_MASTER_IP:443 \

--deploy-mode cluster \

--name $APPNAME \

--py-files $CODE_DIRECTORY_CLOUD/spark_on_gke.zip \

--conf spark.kubernetes.namespace=$NAMESPACE \

--conf spark.network.timeout=300 \

--conf spark.kubernetes.allocation.batch.size=3 \

--conf spark.kubernetes.allocation.batch.delay=1 \

--conf spark.kubernetes.driver.container.image=${IMAGEDRIVER} \

--conf spark.kubernetes.executor.container.image=${IMAGEDRIVER} \

--conf spark.kubernetes.driver.pod.name=$APPNAME \

--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-bq \

--conf spark.driver.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true" \

--conf spark.executor.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true" \

--conf spark.dynamicAllocation.enabled=true \

--conf spark.dynamicAllocation.shuffleTracking.enabled=true \

--conf spark.dynamicAllocation.shuffleTracking.timeout=20s \

--conf spark.dynamicAllocation.executorIdleTimeout=30s \

--conf spark.dynamicAllocation.cachedExecutorIdleTimeout=40s \

--conf spark.dynamicAllocation.minExecutors=0 \

--conf spark.dynamicAllocation.maxExecutors=20 \

--conf spark.driver.cores=3 \

--conf spark.executor.cores=3 \

--conf spark.driver.memory=1024m \

--conf spark.executor.memory=1024m \

$CODE_DIRECTORY_CLOUD/${APPLICATION}

HTH

Mich Talebzadeh,

Dad | Technologist | Solutions Architect | Engineer

London

United Kingdom