cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to make .py files available for import?

JooseSauli
New Contributor II

Hello,

I've looked around, but cannot find an answer. In my Azure Databricks workspace, users have Python notebooks which all make use of the same helper functions and classes. Instead of housing the helper code in notebooks and having %run magics in notebooks, I want to organize my helper code in .py files so that users can import them as modules.

This would work if I had, say, foo.py in the same folder where my notebook is (then I could import foo), or in a subfolder (then I could from subfolder import foo). But the notebooks are arranged in different folders, and foo.py is somewhere else, so it will not always be in the notebook's sys.path.

What can I do to make foo.py available for import in any notebook in any folder?

I know I could import sys and sys.path.append(<path to foo>) at the start of every notebook, but I don't want to do that either, as I'm trying to make things simple for people writing notebooks.

I tried using a cluster init script which would 1) create a folder for foo.py under /databricks/python_scripts/, which happens to be in my sys.path, and 2) copy foo.py there from /Workspace/Shared/foo/. The script would create the folder all right, but could not copy Workspace files there. (I tried placing foo.py in an Azure Data Lake Storage, and having the init script copy it from there. That worked, but clearly that's not a good solution.)

Perhaps I could create a library for foo.py and install it on the cluster, but that seems pretty convoluted compared to simply having foo.py somewhere the import command will find it. How can I accomplish that?

Thanks,

JS

3 REPLIES 3

Brahmareddy
Honored Contributor III

Hi JooseSauli,

How are you doing today?, As per my understanding, Hey JS, totally get where you're coming from—it's super common to want to keep shared helper code clean and reusable without making every notebook author mess with sys.path. The smoothest way to do this in Databricks is to package your helper code as a Python wheel (.whl) and install it as a custom library on the cluster, either through the UI or using a Databricks Asset Bundle if you're deploying. This way, the module is available across all notebooks, regardless of folder structure, and users can simply import foo without worrying about paths or %run. It may feel like extra setup at first, but once it’s packaged, updating and maintaining it becomes super easy. If you’re not ready for packaging yet, another option is to store the .py files in DBFS (like /dbfs/python_libs/foo.py) and just set the path once in a cluster init script, which adds it to sys.path globally on startup. This avoids manual changes in every notebook. Let me know if you want a quick guide for either setup—happy to help make it easy for your team!

Regards,

Brahma

JooseSauli
New Contributor II

Hi Brahmareddy,

Thanks for your reply. Your second approach is quite close to what I already tried earlier. Your post got me to do some more testing, and although I don't know how to set the sys.path via the init script (it says here and here that it's not possible), ultimately I found a way that works well enough for me.

Earlier I tried to keep foo.py in /databricks/python_scripts/, but this dir would be erased every time the cluster went down. Then I tried having the init script copy foo.py from /Workspace/Shared/, but that dir is not mounted at the time when the init script runs. Your post got me to try if /dbfs/ persists through shutdowns, and it does. So now I keep foo.py in /dbfs/foo/, and the init script copies it to /databricks/python_scripts/.

This approach has the benefit that it allows me to have parallel versions of my helper modules (a production version and a dev version), and when I want to do tests, I can import from a dev version by manipulating sys.path (other users will never need to do this). In the package-installation approach I doubt that this would be possible, would it?

BR,

JS

Brahmareddy
Honored Contributor III

Hi JooseSauli,

How are doing today? it sounds like you’ve landed on a pretty solid and flexible setup! You’re absolutely right—/dbfs/ is persistent across cluster restarts, which makes it perfect for storing your shared helper modules like foo.py, and then using an init script to copy it into /databricks/python_scripts/ where it becomes importable. Also, love how you’re using that to maintain both prod and dev versions—being able to switch between them by adjusting sys.path for testing is super handy, and something that would be much harder to pull off cleanly if you went with the packaged .whl approach. That route is better for production stability and version control, but not as flexible for quick iteration or parallel testing like you're doing. So if your current setup is working well and giving you the dev/prod isolation you need, I’d say you’ve got a great balance going—well done! Let me know if you ever want help packaging it later if the project grows.

Regards,

Brahma

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now