cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How do I import a python module when deploying with DAB?

kenny_hero
New Contributor II

Below is how the folder structure of my project looks like: 

resources/
  |- etl_event/
     |- etl_event.job.yml
src/
  |- pipeline/
     |- etl_event/
        |- transformers/
          |- transformer_1.py
  |- utils/
     |- logger.py
databricks.yml

I created a python module, logger.py, that I want to reuse in many transformers, so I put it in utils folder. I want transformer_1.py to import get_logger() function from the module:

# transformer_1.py
from pyspark import pipelines as dp
from utils.logger import get_logger

@dp.table
def load_events():
    logger = get_logger(__name__)
    logger.info('Loading events...')
    ...

After I deploy my project using DAB, running "etl_event" job fails because 'utils' module is not found.

I learned the following from researching the documents:

  1.  PYTHONPATH is not set to the root of the bundle.
  2. One suggestion is to use `sys.path.append()` before each import. This feels rather a hack that I must remember to manipulate system path before each import.
  3. Another suggestion is to build a python wheel file. I'm not sure if that applies to my project because the module(utils/logger.py) is already included in bundle deployed. I should not need to build a separate wheel file.

My question is what is the proper way to configure my project/bundle so that I can have reusable python modules to be used in pipeline transformers?

 

6 REPLIES 6

Hubert-Dudek
Databricks MVP

Under src I would add the parent directory for pipeline and  utils and put there __init__.py

resources/
  |- etl_event/
     |- etl_event.job.yml
src/
  |- project/
    __init__.py
    |- pipeline/
       |- etl_event/
          |- transformers/
            |- transformer_1.py
     |- utils/
       __init__.py
       |- logger.py
databricks.yml

My blog: https://databrickster.medium.com/

Sanjeeb2024
Contributor III

Thank you @Hubert-Dudek . This is very useful.

Sanjeeb Mohapatra

kenny_hero
New Contributor II

@Hubert-Dudek, thank you for your response. I really appreciate it.

However, I still cannot get the import to work even after I follow your instructions. Here is the folder structure:

import_test_project_structure.png

The transformer code is below:

from pyspark import pipelines as dp

from project.utils.logger import get_logger
#from utils.logger import get_logger

@dp.table
def sample_users():
    logger = get_logger(__name__)
    logger.info('Running sample_users transformer')
    return (
        spark.read.table("samples.wanderbricks.users")
        .select("user_id", "email", "name", "user_type")
    )

After deploying the bundle, I tried running the job but it still fails to import the logger module.

Traceback (most recent call last):
File "/Workspace/Users/.../.bundle/import_test/dev/files/src/project/pipelines/import_test/transformations/sample_users.py", cell 1, line 3
      1 from pyspark import pipelines as dp
----> 3 from project.utils.logger import get_logger
      4 #from utils.logger import get_logger
      6 @dp.table
      7 def sample_users():

File "/databricks/python_shell/lib/dbruntime/autoreload/discoverability/hook.py", line 71, in AutoreloadDiscoverabilityHook._patched_import(self, name, *args, **kwargs)
     65 if not self._should_hint and (
     66     (module := sys.modules.get(absolute_name)) is not None and
     67     (fname := get_allowed_file_name_or_none(module)) is not None and
     68     (mtime := os.stat(fname).st_mtime) > self.last_mtime_by_modname.get(
     69         absolute_name, float("inf")) and not self._should_hint):
     70     self._should_hint = True
---> 71 module = self._original_builtins_import(name, *args, **kwargs)
     72 if (fname := fname or get_allowed_file_name_or_none(module)) is not None:
     73     mtime = mtime or os.stat(fname).st_mtime

ModuleNotFoundError: No module named 'project'
Error: [update_progress]  Update 803be7 is FAILED.

 I tried importing from both "project.utils.logger" and "utils.logger" and neither worked.

Can you help me figure out what I'm doing wrong?

you have to __init__.py has to be under src/__init__.py folder. looks like you created it under utils/__init__.py

@rcdatabricks,

I tried that.

Screenshot 2026-01-06 at 10.03.04 AM.png

But import still fails.

ile "/Workspace/Users/.../.bundle/import_test/dev/files/src/project/pipelines/import_test/transformations/sample_users.py", cell 1, line 3
      1 from pyspark import pipelines as dp
----> 3 from project.utils.logger import get_logger
      4 # from utils.logger import get_logger
      6 @dp.table
      7 def sample_users():

File "/databricks/python_shell/lib/dbruntime/autoreload/discoverability/hook.py", line 71, in AutoreloadDiscoverabilityHook._patched_import(self, name, *args, **kwargs)
     65 if not self._should_hint and (
     66     (module := sys.modules.get(absolute_name)) is not None and
     67     (fname := get_allowed_file_name_or_none(module)) is not None and
     68     (mtime := os.stat(fname).st_mtime) > self.last_mtime_by_modname.get(
     69         absolute_name, float("inf")) and not self._should_hint):
     70     self._should_hint = True
---> 71 module = self._original_builtins_import(name, *args, **kwargs)
     72 if (fname := fname or get_allowed_file_name_or_none(module)) is not None:
     73     mtime = mtime or os.stat(fname).st_mtime

ModuleNotFoundError: No module named 'project'

I'm new to Databricks, so I may not fully understand how bundle deployment work. I hope someone can share knowledge to help me with a problem what I feel trivial.

kenny_hero
New Contributor II

Just to share knowledge with other readers, I think the best option for now(until Databricks supports adding project root to PYTHONPATH) is to refactor the project to have custom modules in a separate folder, create a wheel when building DAB and declare as dependency of pipeline/job config.

I followed the instruction in this page, but struggled to make it work. It turned out dependency config is different if pipeline runs on serverless. Instead of using "libraries" as in the instruction , I have to use "environment".

resources:
  pipelines:
    <pipeline name>
      environment:
        dependencies:
          - <path to wheel file>

I wish the instruction included this example for serverless, or Databricks supports same configuration regardless of compute type.