cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to set Python rootpath when deploying with DABs

RobinK
Contributor

We have structured our code according to the documentation (notebooks-best-practices). We use Jupyter notebooks and have outsourced logic to Python modules. Unfortunately, the example described in the documentation only works if you have checked out the code as a repository in Databricks, because only in this case is the Python rootpath set correctly (output of the "sys.path" Python variable contains the path to the repository root). However, since we are now deploying our code with Databricks Asset Bundles, it is not checked out as a repository in Databricks, but ends up in the workspace. Apparently, only the path of the current folder is set there (the output of the "sys.path" Python variable only contains the path to the current folder). As a result, it is not possible to load modules that are in other folders.

Folder structure

- my_project
-- notebooks
--- my_notebook.ipynb
-- my_modules
--- my_module.py

In the notebook I want to import functions of my module: 

from my_modules.my_module import *
But an error occurs: 
ModuleNotFoundError

Output of 

import sys
sys.path

When I check out the example code from the docs (it contains the current folder path and the path of the root folder):

['/databricks/python_shell/scripts',
 '/local_disk0/spark-64d2e6e2-e358-49fb-ad1b-c95f90915d9e/userFiles-7afa9a58-9505-40ec-94e9-a073787631d8',
 '/databricks/spark/python',
 '/databricks/spark/python/lib/py4j-0.10.9.7-src.zip',
 '/databricks/jars/spark--driver--driver-spark_3.5_2.12_deploy.jar',
 '/Workspace/Repos/<username>/notebook-best-practices/notebooks',
 '/Workspace/Repos/<username>/notebook-best-practices',
 '/databricks/python_shell',
 '/usr/lib/python310.zip',
 '/usr/lib/python3.10',
 '/usr/lib/python3.10/lib-dynload',
 '',
 '/local_disk0/.ephemeral_nfs/envs/pythonEnv-1f4b77f0-6bd3-4d36-a1cc-693215915cd6/lib/python3.10/site-packages',
 '/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages',
 '/databricks/python/lib/python3.10/site-packages',
 '/usr/local/lib/python3.10/dist-packages',
 '/usr/lib/python3/dist-packages']

Of my own code (it only contains the path of the current folder, but not the root path):

['/databricks/python_shell/scripts',
 '/local_disk0/spark-64d2e6e2-e358-49fb-ad1b-c95f90915d9e/userFiles-7afa9a58-9505-40ec-94e9-a073787631d8',
 '/databricks/spark/python',
 '/databricks/spark/python/lib/py4j-0.10.9.7-src.zip',
 '/databricks/jars/spark--driver--driver-spark_3.5_2.12_deploy.jar',
 '/databricks/python_shell',
 '/usr/lib/python310.zip',
 '/usr/lib/python3.10',
 '/usr/lib/python3.10/lib-dynload',
 '/local_disk0/.ephemeral_nfs/envs/pythonEnv-c070c35e-ee85-452e-a03e-fe117d43a515/lib/python3.10/site-packages',
 '/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages',
 '/databricks/python/lib/python3.10/site-packages',
 '/usr/local/lib/python3.10/dist-packages',
 '/usr/lib/python3/dist-packages',
 '',
 '/Workspace/Users/<username>/modular_test/notebooks']

How do I set the root path of my code, when deploying it with Databricks Asset Bundles?

Is there a better way, then manipulating the sys.path variable? This code works, but we have to copy it in each notebook that imports modules.

notebook_path = dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get()
sys.path.append("/Workspace" + os.path.dirname(os.path.dirname(os.path.dirname(notebook_path))))
1 ACCEPTED SOLUTION

Accepted Solutions

Corbin
New Contributor III
New Contributor III

Hello Robin,

You’ll have to either use wheel files to package your libs and use those (see docs here), to make imports work out of the box. Otherwise, your entry point file needs to add the bundle root directory (or whatever the lib directory is) to your sys.path. This can be achieved by adding a parameter and then process that parameter in the first couple of lines of your entry point file (as defined in the bundle).

For example, if your bundle task looks like so:

- task_key: stream
notebook_task:
notebook_path: /path/to/file.ipynb
base_parameters:
bundle_root: ${workspace.file_path}
job_cluster_key: Job_cluster

note that "bundle_root" is arbitrary, it can be whatever you want it to be called. ${workspace.file_path} will substitute this value at runtime so it will work in different workspaces seemlessly.

Now in your /path/to/file.ipynb file, you would do something like:

root = dbutils.widgets.get("bundle_root")
sys.path.append(root)

 Hope this helps!

View solution in original post

5 REPLIES 5

TimReddick
Contributor

Hi @RobinK, I asked a similar question over here: https://community.databricks.com/t5/warehousing-analytics/import-python-files-as-modules-in-workspac.... I found this resource that gives some good examples of how to import .py modules from outside of your current working directory: https://github.com/databricks-academy/cli-demo/blob/published/notebooks/00_refactoring_to_relative_i.... It involves appending your root directory from which you want to import to sys.path. This feels a little hacky (and something I think you typically want to avoid in normal Python development), but it does seem to be the recommended approach. This is basically what Databricks is doing for you when you are working within a Repo.

Corbin
New Contributor III
New Contributor III

Hello Robin,

You’ll have to either use wheel files to package your libs and use those (see docs here), to make imports work out of the box. Otherwise, your entry point file needs to add the bundle root directory (or whatever the lib directory is) to your sys.path. This can be achieved by adding a parameter and then process that parameter in the first couple of lines of your entry point file (as defined in the bundle).

For example, if your bundle task looks like so:

- task_key: stream
notebook_task:
notebook_path: /path/to/file.ipynb
base_parameters:
bundle_root: ${workspace.file_path}
job_cluster_key: Job_cluster

note that "bundle_root" is arbitrary, it can be whatever you want it to be called. ${workspace.file_path} will substitute this value at runtime so it will work in different workspaces seemlessly.

Now in your /path/to/file.ipynb file, you would do something like:

root = dbutils.widgets.get("bundle_root")
sys.path.append(root)

 Hope this helps!

Hi Corbin,

thank you for the reply. We can work with your solution, but we still have to copy the same code for reading the parameters and it only works for deploying the code with DABs. For local development or debugging in the deployed DABs Code in Databricks we would need to add another workaround, because the parameters (for example "bundle_root") are only set within the job.

Are there any plans to include a parameter "bundle_root" to DABs, that will be set in the "sys.path" variable like in Databricks repos?

 

JeremyFord
New Contributor III

Hi @RobinK I've been struggling with the same. I've experimented with the following and it looks like it does what you want. 

1. Create a new Notebook (set_notebook_paths.py) just for setting the paths as above using the "bundle_root" method. My actual notebooks are two levels deep from the root hence the "../..".

 

# Databricks notebook source

import sys
import os

try:
    root = dbutils.widgets.get("bundle_root")
    if root:
        print("bundle_root: " + root)
    else:
        root = os.path.abspath('../..')
except:
    print("bundle_root not defined. Using relative path.")
    root = os.path.abspath('../..')

sys.path.append(root)
print(sys.path)

 

2. Use the MAGIC %run in your main notebook to run that notebook.  That magically makes it work when running as a Job, but is ignored for local development.

 

# Databricks notebook source

# MAGIC %run ../set_notebook_paths

# COMMAND ----------

from lib.blah import some_function

 

It not as nice as the "Files in Repo's" that just works without all this path manipulation 😞

Hope that helps, and if you find a better way please let me know!

Thanks! I had to set bundle_root as /Workspace${workspace.file_path} to get my import to work properly. Is this expected?

TimReddick_0-1702471131116.png

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group