cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How do I import a python module when deploying with DAB?

kenny_hero
New Contributor II

Below is how the folder structure of my project looks like: 

resources/
  |- etl_event/
     |- etl_event.job.yml
src/
  |- pipeline/
     |- etl_event/
        |- transformers/
          |- transformer_1.py
  |- utils/
     |- logger.py
databricks.yml

I created a python module, logger.py, that I want to reuse in many transformers, so I put it in utils folder. I want transformer_1.py to import get_logger() function from the module:

# transformer_1.py
from pyspark import pipelines as dp
from utils.logger import get_logger

@dp.table
def load_events():
    logger = get_logger(__name__)
    logger.info('Loading events...')
    ...

After I deploy my project using DAB, running "etl_event" job fails because 'utils' module is not found.

I learned the following from researching the documents:

  1.  PYTHONPATH is not set to the root of the bundle.
  2. One suggestion is to use `sys.path.append()` before each import. This feels rather a hack that I must remember to manipulate system path before each import.
  3. Another suggestion is to build a python wheel file. I'm not sure if that applies to my project because the module(utils/logger.py) is already included in bundle deployed. I should not need to build a separate wheel file.

My question is what is the proper way to configure my project/bundle so that I can have reusable python modules to be used in pipeline transformers?

 

5 REPLIES 5

Hubert-Dudek
Databricks MVP

Under src I would add the parent directory for pipeline and  utils and put there __init__.py

resources/
  |- etl_event/
     |- etl_event.job.yml
src/
  |- project/
    __init__.py
    |- pipeline/
       |- etl_event/
          |- transformers/
            |- transformer_1.py
     |- utils/
       __init__.py
       |- logger.py
databricks.yml

My blog: https://databrickster.medium.com/

Sanjeeb2024
Contributor III

Thank you @Hubert-Dudek . This is very useful.

Sanjeeb Mohapatra

kenny_hero
New Contributor II

@Hubert-Dudek, thank you for your response. I really appreciate it.

However, I still cannot get the import to work even after I follow your instructions. Here is the folder structure:

import_test_project_structure.png

The transformer code is below:

from pyspark import pipelines as dp

from project.utils.logger import get_logger
#from utils.logger import get_logger

@dp.table
def sample_users():
    logger = get_logger(__name__)
    logger.info('Running sample_users transformer')
    return (
        spark.read.table("samples.wanderbricks.users")
        .select("user_id", "email", "name", "user_type")
    )

After deploying the bundle, I tried running the job but it still fails to import the logger module.

Traceback (most recent call last):
File "/Workspace/Users/.../.bundle/import_test/dev/files/src/project/pipelines/import_test/transformations/sample_users.py", cell 1, line 3
      1 from pyspark import pipelines as dp
----> 3 from project.utils.logger import get_logger
      4 #from utils.logger import get_logger
      6 @dp.table
      7 def sample_users():

File "/databricks/python_shell/lib/dbruntime/autoreload/discoverability/hook.py", line 71, in AutoreloadDiscoverabilityHook._patched_import(self, name, *args, **kwargs)
     65 if not self._should_hint and (
     66     (module := sys.modules.get(absolute_name)) is not None and
     67     (fname := get_allowed_file_name_or_none(module)) is not None and
     68     (mtime := os.stat(fname).st_mtime) > self.last_mtime_by_modname.get(
     69         absolute_name, float("inf")) and not self._should_hint):
     70     self._should_hint = True
---> 71 module = self._original_builtins_import(name, *args, **kwargs)
     72 if (fname := fname or get_allowed_file_name_or_none(module)) is not None:
     73     mtime = mtime or os.stat(fname).st_mtime

ModuleNotFoundError: No module named 'project'
Error: [update_progress]  Update 803be7 is FAILED.

 I tried importing from both "project.utils.logger" and "utils.logger" and neither worked.

Can you help me figure out what I'm doing wrong?

you have to __init__.py has to be under src/__init__.py folder. looks like you created it under utils/__init__.py

@rcdatabricks,

I tried that.

Screenshot 2026-01-06 at 10.03.04โ€ฏAM.png

But import still fails.

ile "/Workspace/Users/.../.bundle/import_test/dev/files/src/project/pipelines/import_test/transformations/sample_users.py", cell 1, line 3
      1 from pyspark import pipelines as dp
----> 3 from project.utils.logger import get_logger
      4 # from utils.logger import get_logger
      6 @dp.table
      7 def sample_users():

File "/databricks/python_shell/lib/dbruntime/autoreload/discoverability/hook.py", line 71, in AutoreloadDiscoverabilityHook._patched_import(self, name, *args, **kwargs)
     65 if not self._should_hint and (
     66     (module := sys.modules.get(absolute_name)) is not None and
     67     (fname := get_allowed_file_name_or_none(module)) is not None and
     68     (mtime := os.stat(fname).st_mtime) > self.last_mtime_by_modname.get(
     69         absolute_name, float("inf")) and not self._should_hint):
     70     self._should_hint = True
---> 71 module = self._original_builtins_import(name, *args, **kwargs)
     72 if (fname := fname or get_allowed_file_name_or_none(module)) is not None:
     73     mtime = mtime or os.stat(fname).st_mtime

ModuleNotFoundError: No module named 'project'

I'm new to Databricks, so I may not fully understand how bundle deployment work. I hope someone can share knowledge to help me with a problem what I feel trivial.