Databricks Community

Abhay_1002 · ‎05-01-2024

In a pyspark application, I am using set of python libraries. In order to handle python dependencies while running pyspark application, I am using the approach provided by spark :

Create archive file of Python virtual environment using required set of libraries
Pass this archive file with --archive option in spark-submit command

Reference : https://spark.apache.org/docs/3.3.2/api/python/user_guide/python_packaging.html

As mention in above document, Need to set following environment variables before running spark-submit command :

export PYSPARK_DRIVER_PYTHON=python # Do not set in cluster modes.
export PYSPARK_PYTHON=./environment/bin/python
spark-submit --archives pyspark_conda_env.tar.gz#environment app.py

But when I am running application on data bricks cluster, I am getting following error
Path not found : ./environment/bin/python

I checked, this archive file is present in different directories in driver and executor nodes. There is no classpath directory as such and due to this I can not use relative path for PYSPARK_PYTHON.

Following are directories where archive file is present in different nodes
driver : /local_disk0/spark-xxxxxx-xxxx-xxxx/userFiles-xxxxx-xxxxx-xxxxx/
executor : /local_disk0/spark-xxxxx-xxxxx-xxxxx/executor-xxxxxx-xxxxxx-xxxxxxxx/spark-xxxxx-xxxx-xxxxx-xxxx/

Please suggest, if this is correct way or is there any other way to handle python dependencies for spark application.

Version : Databricks Runtime Version 12.2 LTS (includes Apache Spark 3.3.2, Scala 2.12)

NandiniN · ‎05-01-2024

Hi,

I have not tried it but based on the doc you have to go by this approach. ./environment/bin/pythonmust be replaced with the correct path.

import os
from pyspark.sql import SparkSession

os.environ['PYSPARK_PYTHON'] = "./environment/bin/python"
spark = SparkSession.builder.config(
    "spark.archives",  # 'spark.yarn.dist.archives' in YARN.
    "pyspark_conda_env.tar.gz#environment").getOrCreate()

In the post, one of the reply does explain The env_dir variable is set to the directory where the conda environment is unpacked by Spark.

https://community.databricks.com/t5/machine-learning/how-to-use-spark-submit-python-task-with-the-us...

Databricks Community

Issue with Python Package Management in Spark application

🔔 ALERT: Act Now to Protect Your Community Account; Secure Your Details Before It's Too Late!

Databricks Learning Festival (Virtual): 10 July - 24 July 2024

Data + AI Summit 2024: An Executive Summary for Data Leaders

Big Data Is Back and Is More Important Than AI