Databricks Community

FranPérez · ‎08-01-2022

I set up a workflow using 2 tasks. Just for demo purposes, I'm using an interactive cluster for running the workflow.

            {
                "task_key": "prepare",
                "spark_python_task": {
                    "python_file": "file:/Workspace/Repos/devops/mlhub-mlops-dev/src/src/prepare_train.py",
                    "parameters": [
                        "/dbfs/raw",
                        "/dbfs/train",
                        "/dbfs/train"
                    ]
                },
                "existing_cluster_id": "XXXX-XXXXXX-XXXXXXXXX",
                "timeout_seconds": 0,
                "email_notifications": {}
            }

As stated in the documentation, I set up the environment variable in the cluster ... this is the excerpt of the json definition of the cluster:

  "spark_env_vars": {
    "PYSPARK_PYTHON": "/databricks/python3/bin/python3",
    "PYTHONPATH": "/Workspace/Repos/devops/mlhub-mlops-dev/src"
  }

Then, when I execute the task of type Python, and I logged the contents of the sys.path I can't find the path configured in the cluster. If I log the contents of os.getenv('PYTHONPATH'), I get nothing. It looks like the environment variables set up at cluster level are not being promoted to the python task

tomasz · ‎08-03-2022

What documentation are you following here?

You shouldn't need to specify PYTHONPATH or PYSPARK_PYTHON as this section is for Spark specific environment variables such as "SPARK_WORKER_MEMORY".

FranPérez · ‎08-03-2022

I'm following the standard Python documentation .. Databricks is compatible with Python AFAIK

This approach works when using "traditional" jobs, but not when using tasks in workflows

User16764241763 · ‎08-03-2022

Could you please try this instead?

import sys

sys.path.append("/Workspace/Repos/devops/mlhub-mlops-dev/src")

You need to do sys.path.append in the udf if the lib need to available on workers.

from pyspark.sql.functions import *

def move_libs_to_executors():

import sys

sys.path.append("/Workspace/Repos/devops/mlhub-mlops-dev/src")

lib_udf = udf(move_libs_to_executors)

df = spark.range(100)

df.withColumn("lib", lib_udf()).show()

FranPérez · ‎08-03-2022

I'm already using this "fix", but this goes against good development practices because you are hardcoding a filepath in your code. This filepath should be provided via a parameter, this is the reason that in most solutions ENVIRONMENT VARIABLES are used for , because the path might change at deployment time.

And as I mentioned before, following the Databricks documentation, you should be able to set environment variables using the spark_env_vars section. Is there anything wrong with my initial approach?

tomasz · ‎08-05-2022

@Fran Pérez I did a little research on this and found that currently PYTHONPATH will be overwritten on cluster startup time and there is no way to redefine it at this time. At this point we would recommend using the already defined PYTHONPATH directories for your libraries or just using user libraries for this.

To see the PYTHONPATH that's set by default you can run:

%sh echo $PYTHONPATH

as a separate cell in a notebook that's attached to your cluster.

Cintendo · ‎12-25-2022

This won't work for editable library as editable library is append path using site package from easy-install.pth

newenglander · ‎05-23-2025

Hi @tomasz , I am not the original poster but as it is now 3 years later I wanted to ask: is it still the case that PYTHONPATH cannot be modified from an init script in a way that won't be overwritten?

Is there a solution for putting a Workspace directory on python's path aside from explicitly modifying `sys.path` in every executable notebook or python script?

Would `pip install -e /Workspace/<etc>` work in an init script?

Thank you

jose_gonzalez · ‎08-30-2022

Hi @Fran Pérez,

Just a friendly follow-up. Did any of the responses help you to resolve your question? if it did, please mark it as best. Otherwise, please let us know if you still need help.

Databricks Community

set PYTHONPATH when executing workflows

Join Us as a Local Community Builder!

🚀 Weekly Delta (1 - 7 October): A Look Back at This Week’s Top Community Highlights!

🌟 Community Sparks of the Week | September 26 – October 2 🌟

Solution Accelerator Series | #4 - Toxicity Detection for Gaming

Level Up with Databricks Specialist Sessions

Announcing Data Intelligence for Cybersecurity