cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

Issue with Python Package Management in Spark application

Abhay_1002
New Contributor

In a pyspark application, I am using set of python libraries. In order to handle python dependencies while running pyspark application, I am using the approach provided by spark : 

  • Create archive file of Python virtual environment using required set of libraries 
  • Pass this archive file with --archive option in spark-submit command

Reference : https://spark.apache.org/docs/3.3.2/api/python/user_guide/python_packaging.html

As mention in above document, Need to set following environment variables before running spark-submit command :
 
export PYSPARK_DRIVER_PYTHON=python # Do not set in cluster modes.
export PYSPARK_PYTHON=./environment/bin/python
spark-submit --archives pyspark_conda_env.tar.gz#environment app.py

But when I am running application on data bricks cluster, I am getting following error
Path not found : ./environment/bin/python 

I checked, this archive file is present in different directories in driver and executor nodes. There is no classpath directory as such and due to this I can not use relative path for PYSPARK_PYTHON.

Following are directories where archive file is present in different nodes
driver : /local_disk0/spark-xxxxxx-xxxx-xxxx/userFiles-xxxxx-xxxxx-xxxxx/
executor : /local_disk0/spark-xxxxx-xxxxx-xxxxx/executor-xxxxxx-xxxxxx-xxxxxxxx/spark-xxxxx-xxxx-xxxxx-xxxx/

Please suggest, if this is correct way or is there any other way to handle python dependencies for spark application.

Version : Databricks Runtime Version 12.2 LTS (includes Apache Spark 3.3.2, Scala 2.12)

1 REPLY 1

NandiniN
Databricks Employee
Databricks Employee

Hi,

I have not tried it but based on the doc you have to go by this approach. ./environment/bin/pythonmust be replaced with the correct path.

import os
from pyspark.sql import SparkSession

os.environ['PYSPARK_PYTHON'] = "./environment/bin/python"
spark = SparkSession.builder.config(
    "spark.archives",  # 'spark.yarn.dist.archives' in YARN.
    "pyspark_conda_env.tar.gz#environment").getOrCreate() 

In the post, one of the reply does explain The env_dir variable is set to the directory where the conda environment is unpacked by Spark. 

https://community.databricks.com/t5/machine-learning/how-to-use-spark-submit-python-task-with-the-us...

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now