cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Community Discussions
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Issue with Python Package Management in Spark application

Abhay_1002
New Contributor

In a pyspark application, I am using set of python libraries. In order to handle python dependencies while running pyspark application, I am using the approach provided by spark : 

  • Create archive file of Python virtual environment using required set of libraries 
  • Pass this archive file with --archive option in spark-submit command

Reference : https://spark.apache.org/docs/3.3.2/api/python/user_guide/python_packaging.html

As mention in above document, Need to set following environment variables before running spark-submit command :
 
export PYSPARK_DRIVER_PYTHON=python # Do not set in cluster modes.
export PYSPARK_PYTHON=./environment/bin/python
spark-submit --archives pyspark_conda_env.tar.gz#environment app.py

But when I am running application on data bricks cluster, I am getting following error
Path not found : ./environment/bin/python 

I checked, this archive file is present in different directories in driver and executor nodes. There is no classpath directory as such and due to this I can not use relative path for PYSPARK_PYTHON.

Following are directories where archive file is present in different nodes
driver : /local_disk0/spark-xxxxxx-xxxx-xxxx/userFiles-xxxxx-xxxxx-xxxxx/
executor : /local_disk0/spark-xxxxx-xxxxx-xxxxx/executor-xxxxxx-xxxxxx-xxxxxxxx/spark-xxxxx-xxxx-xxxxx-xxxx/

Please suggest, if this is correct way or is there any other way to handle python dependencies for spark application.

Version : Databricks Runtime Version 12.2 LTS (includes Apache Spark 3.3.2, Scala 2.12)

1 REPLY 1

NandiniN
Valued Contributor III
Valued Contributor III

Hi,

I have not tried it but based on the doc you have to go by this approach. ./environment/bin/pythonmust be replaced with the correct path.

import os
from pyspark.sql import SparkSession

os.environ['PYSPARK_PYTHON'] = "./environment/bin/python"
spark = SparkSession.builder.config(
    "spark.archives",  # 'spark.yarn.dist.archives' in YARN.
    "pyspark_conda_env.tar.gz#environment").getOrCreate() 

In the post, one of the reply does explain The env_dir variable is set to the directory where the conda environment is unpacked by Spark. 

https://community.databricks.com/t5/machine-learning/how-to-use-spark-submit-python-task-with-the-us...