<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: set PYTHONPATH when executing workflows in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/set-pythonpath-when-executing-workflows/m-p/11837#M6752</link>
    <description>&lt;P&gt;I'm following the standard Python &lt;A href="https://docs.python.org/3/using/cmdline.html?highlight=pythonpath#envvar-PYTHONPATH" alt="https://docs.python.org/3/using/cmdline.html?highlight=pythonpath#envvar-PYTHONPATH" target="_blank"&gt;documentation&lt;/A&gt; .. Databricks is compatible with Python AFAIK&lt;/P&gt;&lt;P&gt;This approach works when using "traditional" jobs, but not when using  tasks in workflows&lt;/P&gt;</description>
    <pubDate>Thu, 04 Aug 2022 05:39:08 GMT</pubDate>
    <dc:creator>FranPérez</dc:creator>
    <dc:date>2022-08-04T05:39:08Z</dc:date>
    <item>
      <title>set PYTHONPATH when executing workflows</title>
      <link>https://community.databricks.com/t5/data-engineering/set-pythonpath-when-executing-workflows/m-p/11835#M6750</link>
      <description>&lt;P&gt;I set up a workflow using 2 tasks. Just for demo purposes, I'm using an interactive cluster for running the workflow. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;            {
                "task_key": "prepare",
                "spark_python_task": {
                    "python_file": "file:/Workspace/Repos/devops/mlhub-mlops-dev/src/src/prepare_train.py",
                    "parameters": [
                        "/dbfs/raw",
                        "/dbfs/train",
                        "/dbfs/train"
                    ]
                },
                "existing_cluster_id": "XXXX-XXXXXX-XXXXXXXXX",
                "timeout_seconds": 0,
                "email_notifications": {}
            }&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;As stated in the documentation, I set up the environment variable in the cluster ... this is the excerpt of the json definition of the cluster:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;  "spark_env_vars": {
    "PYSPARK_PYTHON": "/databricks/python3/bin/python3",
    "PYTHONPATH": "/Workspace/Repos/devops/mlhub-mlops-dev/src"
  }&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;Then, when I execute the task of type Python, and I logged the contents of the&lt;B&gt; sys.path&lt;/B&gt; I can't find the path configured in the cluster. If I log the contents of &lt;B&gt;os.getenv('PYTHONPATH')&lt;/B&gt;, I get nothing. It looks like the environment variables set up at cluster level are not being promoted to the python task&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 01 Aug 2022 07:37:10 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/set-pythonpath-when-executing-workflows/m-p/11835#M6750</guid>
      <dc:creator>FranPérez</dc:creator>
      <dc:date>2022-08-01T07:37:10Z</dc:date>
    </item>
    <item>
      <title>Re: set PYTHONPATH when executing workflows</title>
      <link>https://community.databricks.com/t5/data-engineering/set-pythonpath-when-executing-workflows/m-p/11836#M6751</link>
      <description>&lt;P&gt;What documentation are you following here? &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;You shouldn't need to specify PYTHONPATH or PYSPARK_PYTHON as this section is for Spark specific environment variables such as "SPARK_WORKER_MEMORY".&lt;/P&gt;</description>
      <pubDate>Wed, 03 Aug 2022 19:56:09 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/set-pythonpath-when-executing-workflows/m-p/11836#M6751</guid>
      <dc:creator>tomasz</dc:creator>
      <dc:date>2022-08-03T19:56:09Z</dc:date>
    </item>
    <item>
      <title>Re: set PYTHONPATH when executing workflows</title>
      <link>https://community.databricks.com/t5/data-engineering/set-pythonpath-when-executing-workflows/m-p/11837#M6752</link>
      <description>&lt;P&gt;I'm following the standard Python &lt;A href="https://docs.python.org/3/using/cmdline.html?highlight=pythonpath#envvar-PYTHONPATH" alt="https://docs.python.org/3/using/cmdline.html?highlight=pythonpath#envvar-PYTHONPATH" target="_blank"&gt;documentation&lt;/A&gt; .. Databricks is compatible with Python AFAIK&lt;/P&gt;&lt;P&gt;This approach works when using "traditional" jobs, but not when using  tasks in workflows&lt;/P&gt;</description>
      <pubDate>Thu, 04 Aug 2022 05:39:08 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/set-pythonpath-when-executing-workflows/m-p/11837#M6752</guid>
      <dc:creator>FranPérez</dc:creator>
      <dc:date>2022-08-04T05:39:08Z</dc:date>
    </item>
    <item>
      <title>Re: set PYTHONPATH when executing workflows</title>
      <link>https://community.databricks.com/t5/data-engineering/set-pythonpath-when-executing-workflows/m-p/11838#M6753</link>
      <description>&lt;P&gt;Could you please try this instead?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;import sys&lt;/P&gt;&lt;P&gt;sys.path.append("/Workspace/Repos/devops/mlhub-mlops-dev/src")&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;You need to do sys.path.append in the udf if the lib need to available on workers.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;from pyspark.sql.functions import *&lt;/P&gt;&lt;P&gt;def move_libs_to_executors():&lt;/P&gt;&lt;P&gt;    import sys&lt;/P&gt;&lt;P&gt;    sys.path.append("/Workspace/Repos/devops/mlhub-mlops-dev/src")&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;lib_udf = udf(move_libs_to_executors)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;df = spark.range(100)&lt;/P&gt;&lt;P&gt;df.withColumn("lib", lib_udf()).show()&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 04 Aug 2022 05:48:11 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/set-pythonpath-when-executing-workflows/m-p/11838#M6753</guid>
      <dc:creator>User16764241763</dc:creator>
      <dc:date>2022-08-04T05:48:11Z</dc:date>
    </item>
    <item>
      <title>Re: set PYTHONPATH when executing workflows</title>
      <link>https://community.databricks.com/t5/data-engineering/set-pythonpath-when-executing-workflows/m-p/11839#M6754</link>
      <description>&lt;P&gt;I'm already using this "fix", but this goes against good development practices because you are hardcoding a filepath in your code. This filepath should be provided via a parameter, this is the reason that in most solutions ENVIRONMENT VARIABLES are used for , because the path might change at deployment time.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;And as I mentioned before, following the Databricks documentation, you should be able to set environment variables using the &lt;B&gt;spark_env_vars&lt;/B&gt; section. Is there anything wrong with my initial approach?&lt;/P&gt;</description>
      <pubDate>Thu, 04 Aug 2022 06:25:52 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/set-pythonpath-when-executing-workflows/m-p/11839#M6754</guid>
      <dc:creator>FranPérez</dc:creator>
      <dc:date>2022-08-04T06:25:52Z</dc:date>
    </item>
    <item>
      <title>Re: set PYTHONPATH when executing workflows</title>
      <link>https://community.databricks.com/t5/data-engineering/set-pythonpath-when-executing-workflows/m-p/11840#M6755</link>
      <description>&lt;P&gt;@Fran Pérez​&amp;nbsp;I did a little research on this and found that currently PYTHONPATH will be overwritten on cluster startup time and there is no way to redefine it at this time. At this point we would recommend using the already defined PYTHONPATH directories for your libraries or just using &lt;A href="https://docs.databricks.com/libraries/index.html" alt="https://docs.databricks.com/libraries/index.html" target="_blank"&gt;user libraries&lt;/A&gt; for this. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;To see the PYTHONPATH that's set by default you can run:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;%sh echo $PYTHONPATH&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;as a separate cell in a notebook that's attached to your cluster.&lt;/P&gt;</description>
      <pubDate>Fri, 05 Aug 2022 16:25:33 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/set-pythonpath-when-executing-workflows/m-p/11840#M6755</guid>
      <dc:creator>tomasz</dc:creator>
      <dc:date>2022-08-05T16:25:33Z</dc:date>
    </item>
    <item>
      <title>Re: set PYTHONPATH when executing workflows</title>
      <link>https://community.databricks.com/t5/data-engineering/set-pythonpath-when-executing-workflows/m-p/11841#M6756</link>
      <description>&lt;P&gt;Hi @Fran Pérez​,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Just a friendly follow-up. Did any of the responses help you to resolve your question? if it did, please mark it as best. Otherwise, please let us know if you still need help.&lt;/P&gt;</description>
      <pubDate>Tue, 30 Aug 2022 17:07:06 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/set-pythonpath-when-executing-workflows/m-p/11841#M6756</guid>
      <dc:creator>jose_gonzalez</dc:creator>
      <dc:date>2022-08-30T17:07:06Z</dc:date>
    </item>
    <item>
      <title>Re: set PYTHONPATH when executing workflows</title>
      <link>https://community.databricks.com/t5/data-engineering/set-pythonpath-when-executing-workflows/m-p/11842#M6757</link>
      <description>&lt;P&gt;This won't work for editable library as editable library is append path using site package from easy-install.pth&lt;/P&gt;</description>
      <pubDate>Mon, 26 Dec 2022 01:52:09 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/set-pythonpath-when-executing-workflows/m-p/11842#M6757</guid>
      <dc:creator>Cintendo</dc:creator>
      <dc:date>2022-12-26T01:52:09Z</dc:date>
    </item>
    <item>
      <title>Re: set PYTHONPATH when executing workflows</title>
      <link>https://community.databricks.com/t5/data-engineering/set-pythonpath-when-executing-workflows/m-p/120112#M46069</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/1574"&gt;@tomasz&lt;/a&gt;&amp;nbsp;, I am not the original poster but as it is now 3 years later I wanted to ask: is it still the case that PYTHONPATH cannot be modified from an init script in a way that won't be overwritten?&lt;/P&gt;&lt;P&gt;Is there a solution for putting a Workspace directory on python's path aside from explicitly modifying `sys.path` in every executable notebook or python script?&lt;/P&gt;&lt;P&gt;Would `pip install -e /Workspace/&amp;lt;etc&amp;gt;` work in an init script?&lt;/P&gt;&lt;P&gt;Thank you&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 23 May 2025 20:01:12 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/set-pythonpath-when-executing-workflows/m-p/120112#M46069</guid>
      <dc:creator>newenglander</dc:creator>
      <dc:date>2025-05-23T20:01:12Z</dc:date>
    </item>
    <item>
      <title>Re: set PYTHONPATH when executing workflows</title>
      <link>https://community.databricks.com/t5/data-engineering/set-pythonpath-when-executing-workflows/m-p/149852#M53183</link>
      <description>&lt;P&gt;Just checking in again if there is a way to do this in the last few years? As Fran mentioned, `&lt;SPAN&gt;sys.path.append("/Workspace/Repos/devops/mlhub-mlops-dev/src")` is not a great "fix" for the reasons already mentioned.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;I've found that you can do `pip install .` but only on runtime 16.4 (and probably higher).&amp;nbsp;&lt;/P&gt;&lt;P&gt;The other option is to set the working directory to the Repos/ location, but this is not ideal when working on a team because then you'd all have to be deploying to the same location and overwriting each others work.&lt;/P&gt;&lt;P&gt;I am surprised there seems to be no way to simply append to the pythonpath.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 04 Mar 2026 21:56:40 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/set-pythonpath-when-executing-workflows/m-p/149852#M53183</guid>
      <dc:creator>kenmyers-8451</dc:creator>
      <dc:date>2026-03-04T21:56:40Z</dc:date>
    </item>
  </channel>
</rss>

