11-08-2023 04:17 AM
We have structured our code according to the documentation (notebooks-best-practices). We use Jupyter notebooks and have outsourced logic to Python modules. Unfortunately, the example described in the documentation only works if you have checked out the code as a repository in Databricks, because only in this case is the Python rootpath set correctly (output of the "sys.path" Python variable contains the path to the repository root). However, since we are now deploying our code with Databricks Asset Bundles, it is not checked out as a repository in Databricks, but ends up in the workspace. Apparently, only the path of the current folder is set there (the output of the "sys.path" Python variable only contains the path to the current folder). As a result, it is not possible to load modules that are in other folders.
Folder structure
- my_project
-- notebooks
--- my_notebook.ipynb
-- my_modules
--- my_module.py
In the notebook I want to import functions of my module:
from my_modules.my_module import *
ModuleNotFoundError
Output of
import sys sys.path
When I check out the example code from the docs (it contains the current folder path and the path of the root folder):
['/databricks/python_shell/scripts', '/local_disk0/spark-64d2e6e2-e358-49fb-ad1b-c95f90915d9e/userFiles-7afa9a58-9505-40ec-94e9-a073787631d8', '/databricks/spark/python', '/databricks/spark/python/lib/py4j-0.10.9.7-src.zip', '/databricks/jars/spark--driver--driver-spark_3.5_2.12_deploy.jar', '/Workspace/Repos/<username>/notebook-best-practices/notebooks', '/Workspace/Repos/<username>/notebook-best-practices', '/databricks/python_shell', '/usr/lib/python310.zip', '/usr/lib/python3.10', '/usr/lib/python3.10/lib-dynload', '', '/local_disk0/.ephemeral_nfs/envs/pythonEnv-1f4b77f0-6bd3-4d36-a1cc-693215915cd6/lib/python3.10/site-packages', '/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages', '/databricks/python/lib/python3.10/site-packages', '/usr/local/lib/python3.10/dist-packages', '/usr/lib/python3/dist-packages']
Of my own code (it only contains the path of the current folder, but not the root path):
['/databricks/python_shell/scripts', '/local_disk0/spark-64d2e6e2-e358-49fb-ad1b-c95f90915d9e/userFiles-7afa9a58-9505-40ec-94e9-a073787631d8', '/databricks/spark/python', '/databricks/spark/python/lib/py4j-0.10.9.7-src.zip', '/databricks/jars/spark--driver--driver-spark_3.5_2.12_deploy.jar', '/databricks/python_shell', '/usr/lib/python310.zip', '/usr/lib/python3.10', '/usr/lib/python3.10/lib-dynload', '/local_disk0/.ephemeral_nfs/envs/pythonEnv-c070c35e-ee85-452e-a03e-fe117d43a515/lib/python3.10/site-packages', '/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages', '/databricks/python/lib/python3.10/site-packages', '/usr/local/lib/python3.10/dist-packages', '/usr/lib/python3/dist-packages', '', '/Workspace/Users/<username>/modular_test/notebooks']
How do I set the root path of my code, when deploying it with Databricks Asset Bundles?
Is there a better way, then manipulating the sys.path variable? This code works, but we have to copy it in each notebook that imports modules.
notebook_path = dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get() sys.path.append("/Workspace" + os.path.dirname(os.path.dirname(os.path.dirname(notebook_path))))
11-16-2023 06:37 AM
Hello Robin,
You’ll have to either use wheel files to package your libs and use those (see docs here), to make imports work out of the box. Otherwise, your entry point file needs to add the bundle root directory (or whatever the lib directory is) to your sys.path.
This can be achieved by adding a parameter and then process that parameter in the first couple of lines of your entry point file (as defined in the bundle).
For example, if your bundle task looks like so:
- task_key: stream
notebook_task:
notebook_path: /path/to/file.ipynb
base_parameters:
bundle_root: ${workspace.file_path}
job_cluster_key: Job_cluster
note that "bundle_root" is arbitrary, it can be whatever you want it to be called. ${workspace.file_path} will substitute this value at runtime so it will work in different workspaces seemlessly.
Now in your /path/to/file.ipynb file, you would do something like:
root = dbutils.widgets.get("bundle_root")
sys.path.append(root)
Hope this helps!
11-16-2023 05:31 AM - edited 11-16-2023 05:36 AM
Hi @RobinK, I asked a similar question over here: https://community.databricks.com/t5/warehousing-analytics/import-python-files-as-modules-in-workspac.... I found this resource that gives some good examples of how to import .py modules from outside of your current working directory: https://github.com/databricks-academy/cli-demo/blob/published/notebooks/00_refactoring_to_relative_i.... It involves appending your root directory from which you want to import to sys.path. This feels a little hacky (and something I think you typically want to avoid in normal Python development), but it does seem to be the recommended approach. This is basically what Databricks is doing for you when you are working within a Repo.
11-16-2023 06:37 AM
Hello Robin,
You’ll have to either use wheel files to package your libs and use those (see docs here), to make imports work out of the box. Otherwise, your entry point file needs to add the bundle root directory (or whatever the lib directory is) to your sys.path.
This can be achieved by adding a parameter and then process that parameter in the first couple of lines of your entry point file (as defined in the bundle).
For example, if your bundle task looks like so:
- task_key: stream
notebook_task:
notebook_path: /path/to/file.ipynb
base_parameters:
bundle_root: ${workspace.file_path}
job_cluster_key: Job_cluster
note that "bundle_root" is arbitrary, it can be whatever you want it to be called. ${workspace.file_path} will substitute this value at runtime so it will work in different workspaces seemlessly.
Now in your /path/to/file.ipynb file, you would do something like:
root = dbutils.widgets.get("bundle_root")
sys.path.append(root)
Hope this helps!
11-21-2023 05:45 AM
Hi Corbin,
thank you for the reply. We can work with your solution, but we still have to copy the same code for reading the parameters and it only works for deploying the code with DABs. For local development or debugging in the deployed DABs Code in Databricks we would need to add another workaround, because the parameters (for example "bundle_root") are only set within the job.
Are there any plans to include a parameter "bundle_root" to DABs, that will be set in the "sys.path" variable like in Databricks repos?
11-22-2023 06:58 PM
Hi @RobinK I've been struggling with the same. I've experimented with the following and it looks like it does what you want.
1. Create a new Notebook (set_notebook_paths.py) just for setting the paths as above using the "bundle_root" method. My actual notebooks are two levels deep from the root hence the "../..".
# Databricks notebook source
import sys
import os
try:
root = dbutils.widgets.get("bundle_root")
if root:
print("bundle_root: " + root)
else:
root = os.path.abspath('../..')
except:
print("bundle_root not defined. Using relative path.")
root = os.path.abspath('../..')
sys.path.append(root)
print(sys.path)
2. Use the MAGIC %run in your main notebook to run that notebook. That magically makes it work when running as a Job, but is ignored for local development.
# Databricks notebook source
# MAGIC %run ../set_notebook_paths
# COMMAND ----------
from lib.blah import some_function
It not as nice as the "Files in Repo's" that just works without all this path manipulation 😞
Hope that helps, and if you find a better way please let me know!
12-13-2023 04:40 AM
Thanks! I had to set bundle_root as /Workspace${workspace.file_path} to get my import to work properly. Is this expected?
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group