cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Community Platform Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Issue with Python Package Management in Spark application

Abhay_1002
New Contributor

In a pyspark application, I am using set of python libraries. In order to handle python dependencies while running pyspark application, I am using the approach provided by spark : 

  • Create archive file of Python virtual environment using required set of libraries 
  • Pass this archive file with --archive option in spark-submit command

Reference : https://spark.apache.org/docs/3.3.2/api/python/user_guide/python_packaging.html

As mention in above document, Need to set following environment variables before running spark-submit command :
 
export PYSPARK_DRIVER_PYTHON=python # Do not set in cluster modes.
export PYSPARK_PYTHON=./environment/bin/python
spark-submit --archives pyspark_conda_env.tar.gz#environment app.py

But when I am running application on data bricks cluster, I am getting following error
Path not found : ./environment/bin/python 

I checked, this archive file is present in different directories in driver and executor nodes. There is no classpath directory as such and due to this I can not use relative path for PYSPARK_PYTHON.

Following are directories where archive file is present in different nodes
driver : /local_disk0/spark-xxxxxx-xxxx-xxxx/userFiles-xxxxx-xxxxx-xxxxx/
executor : /local_disk0/spark-xxxxx-xxxxx-xxxxx/executor-xxxxxx-xxxxxx-xxxxxxxx/spark-xxxxx-xxxx-xxxxx-xxxx/

Please suggest, if this is correct way or is there any other way to handle python dependencies for spark application.

Version : Databricks Runtime Version 12.2 LTS (includes Apache Spark 3.3.2, Scala 2.12)

1 REPLY 1

NandiniN
Honored Contributor

Hi,

I have not tried it but based on the doc you have to go by this approach. ./environment/bin/pythonmust be replaced with the correct path.

import os
from pyspark.sql import SparkSession

os.environ['PYSPARK_PYTHON'] = "./environment/bin/python"
spark = SparkSession.builder.config(
    "spark.archives",  # 'spark.yarn.dist.archives' in YARN.
    "pyspark_conda_env.tar.gz#environment").getOrCreate() 

In the post, one of the reply does explain The env_dir variable is set to the directory where the conda environment is unpacked by Spark. 

https://community.databricks.com/t5/machine-learning/how-to-use-spark-submit-python-task-with-the-us...

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group