Databricks Community

Abhay_1002 · ‎05-01-2024

In a pyspark application, I am using set of python libraries. In order to handle python dependencies while running pyspark application, I am using the approach provided by spark :

Create archive file of Python virtual environment using required set of libraries
Pass this archive file with --archive option in spark-submit command

Reference : https://spark.apache.org/docs/3.3.2/api/python/user_guide/python_packaging.html

As mention in above document, Need to set following environment variables before running spark-submit command :

export PYSPARK_DRIVER_PYTHON=python # Do not set in cluster modes.
export PYSPARK_PYTHON=./environment/bin/python
spark-submit --archives pyspark_conda_env.tar.gz#environment app.py

But when I am running application on data bricks cluster, I am getting following error
Path not found : ./environment/bin/python

I checked, this archive file is present in different directories in driver and executor nodes. There is no classpath directory as such and due to this I can not use relative path for PYSPARK_PYTHON.

Following are directories where archive file is present in different nodes
driver : /local_disk0/spark-xxxxxx-xxxx-xxxx/userFiles-xxxxx-xxxxx-xxxxx/
executor : /local_disk0/spark-xxxxx-xxxxx-xxxxx/executor-xxxxxx-xxxxxx-xxxxxxxx/spark-xxxxx-xxxx-xxxxx-xxxx/

Please suggest, if this is correct way or is there any other way to handle python dependencies for spark application.

Version : Databricks Runtime Version 12.2 LTS (includes Apache Spark 3.3.2, Scala 2.12)

NandiniN · ‎05-01-2024

Hi,

I have not tried it but based on the doc you have to go by this approach. ./environment/bin/pythonmust be replaced with the correct path.

import os
from pyspark.sql import SparkSession

os.environ['PYSPARK_PYTHON'] = "./environment/bin/python"
spark = SparkSession.builder.config(
    "spark.archives",  # 'spark.yarn.dist.archives' in YARN.
    "pyspark_conda_env.tar.gz#environment").getOrCreate()