cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Permanently add python file path to sys.path in Databricks

Direo
Contributor

If your notebook is in different directory or subdirectory than python module, you cannot import it until you add it to the Python path.

That means that even though all users are using the same module, but since they are all working from different repos, they cannot import it until they add the path.

I wonder maybe it is possible to add module file path to Databricks sys.path permanently or until the file is deleted.

1 ACCEPTED SOLUTION

Accepted Solutions

Hubert-Dudek
Esteemed Contributor III

You can use it inside the same repo. Provide a whole path from the highest repo level in any notebook inside the repo. As you mentioned, if the file is in another repo, you need to use sys.path.append. To make it permanent, you can try to edit global init scripts.

image.png

from directory.sub_directory.my_file import MyClass
 
"""
Repo
-------\directory
------------------\sub_directory
-------------------------------------\my_file 
"""

View solution in original post

11 REPLIES 11

Hubert-Dudek
Esteemed Contributor III

You can use it inside the same repo. Provide a whole path from the highest repo level in any notebook inside the repo. As you mentioned, if the file is in another repo, you need to use sys.path.append. To make it permanent, you can try to edit global init scripts.

image.png

from directory.sub_directory.my_file import MyClass
 
"""
Repo
-------\directory
------------------\sub_directory
-------------------------------------\my_file 
"""

Kaniz_Fatma
Community Manager
Community Manager

Hi @Direo Direo​, We haven’t heard from you on the last response from @Hubert Dudek​​ , and I was checking back to see if his suggestions helped you. Or else, If you have any solution, please share it with the community as it can be helpful to others.

Prabakar
Esteemed Contributor III

@Direo Direo​ you can refer to this. The feature is now public preview.

uzadude
New Contributor III

Hi, the init_script doesn't work for me (worker's pythonpath doesn't get affected).

and the suggested options in the above link don't help either.

is there a way to add another folder to the PYTHONPATH of the workers?

Cintendo
New Contributor III

For worker node, you can set spark config in cluster setting: spark.executorEnv.PYTHONPATH

However you need to make sure you append your Workspace path at the end as worker node needs other system python path.

This seems to be a hack to me. I hope databricks can respond with a more solid solution.

uzadude
New Contributor III

setting the `spark.executorEnv.PYTHONPATH` did not work for me. it looked like Spark/Databricks overwrite this somewhere. I used a simple python UDF to print some properties like `sys.path` and `os.environ` and didn't see the path I added.

Finally, I found a hacky way of using `spark._sc._python_includes`.

you can see my answer to my self here

Cintendo
New Contributor III

Thanks @Ohad Raviv​ . I will try your approach.

spark.executorEnv.PYTHONPATH works only for worker node not driver node. And it needs to set at the cluster initialization stage (under Spark tab). After cluster initialized, databricks overwrite it even if you manually do spark.conf.set.

I prefer setting environment not thru code as codying it breaks the code integrity. It is hard to enforce it when multiple people working on the same cluster. I wish there is a better way in databricks cluster screen, it allows users to append sys.path after the default; or allow people to do editable install (pip install -e) during development.

I checked the worker node PYTHONPATH using the following to make sure it gets appended.

def getworkerenv():

  import os

  return(os.getenv('PYTHONPATH'))

  

sc = spark.sparkContext

sc.parallelize([1]).map(lambda x: getworkerenv()).collect()

uzadude
New Contributor III

the hacky solution above is meant to be used only while developing my own python module - this way I can avoid packaging a whl, deploying to the cluster, restarting the cluster and even restarting the notebook interpreter.

I agree that it is not suited for production. For that I would use either a whl ref in the workflow file or just prepare a docker image.

Jfoxyyc
Valued Contributor

To be honest I'm just inspecting which repo folder I'm running from (dev/test/prod) and sys.path.appending an appropriate path before importing my packages. Seems to work and its covered by the Terraform provider.

uzadude
New Contributor III

The issue with that is that the driver's sys.path is not added to the executors' sys.path, and you could get "module not found" error if your code tries to import one of your modules.

but it will work fine for simple code that is self-contained.

Jfoxyyc
Valued Contributor

I've been successfully using this in Delta Live Table pipelines with many nodes. Seems to work for my use case.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group