In a pyspark application, I am using set of python libraries. In order to handle python dependencies while running pyspark application, I am using the approach provided by spark :
- Create archive file of Python virtual environment using required set of libraries
- Pass this archive file with --archive option in spark-submit command
Reference : https://spark.apache.org/docs/3.3.2/api/python/user_guide/python_packaging.html
As mention in above document, Need to set following environment variables before running spark-submit command :
export PYSPARK_DRIVER_PYTHON=python # Do not set in cluster modes.
export PYSPARK_PYTHON=./environment/bin/python
spark-submit --archives pyspark_conda_env.tar.gz#environment app.py
But when I am running application on data bricks cluster, I am getting following error
Path not found : ./environment/bin/python
I checked, this archive file is present in different directories in driver and executor nodes. There is no classpath directory as such and due to this I can not use relative path for PYSPARK_PYTHON.
Following are directories where archive file is present in different nodes
driver : /local_disk0/spark-xxxxxx-xxxx-xxxx/userFiles-xxxxx-xxxxx-xxxxx/
executor : /local_disk0/spark-xxxxx-xxxxx-xxxxx/executor-xxxxxx-xxxxxx-xxxxxxxx/spark-xxxxx-xxxx-xxxxx-xxxx/
Please suggest, if this is correct way or is there any other way to handle python dependencies for spark application.
Version : Databricks Runtime Version 12.2 LTS (includes Apache Spark 3.3.2, Scala 2.12)