Friday
Below is how the folder structure of my project looks like:
resources/
|- etl_event/
|- etl_event.job.yml
src/
|- pipeline/
|- etl_event/
|- transformers/
|- transformer_1.py
|- utils/
|- logger.py
databricks.ymlI created a python module, logger.py, that I want to reuse in many transformers, so I put it in utils folder. I want transformer_1.py to import get_logger() function from the module:
# transformer_1.py
from pyspark import pipelines as dp
from utils.logger import get_logger
@dp.table
def load_events():
logger = get_logger(__name__)
logger.info('Loading events...')
...After I deploy my project using DAB, running "etl_event" job fails because 'utils' module is not found.
I learned the following from researching the documents:
My question is what is the proper way to configure my project/bundle so that I can have reusable python modules to be used in pipeline transformers?
Saturday
Under src I would add the parent directory for pipeline and utils and put there __init__.py
resources/
|- etl_event/
|- etl_event.job.yml
src/
|- project/
__init__.py
|- pipeline/
|- etl_event/
|- transformers/
|- transformer_1.py
|- utils/
__init__.py
|- logger.py
databricks.yml
Saturday
Thank you @Hubert-Dudek . This is very useful.
Monday
@Hubert-Dudek, thank you for your response. I really appreciate it.
However, I still cannot get the import to work even after I follow your instructions. Here is the folder structure:
The transformer code is below:
from pyspark import pipelines as dp
from project.utils.logger import get_logger
#from utils.logger import get_logger
@dp.table
def sample_users():
logger = get_logger(__name__)
logger.info('Running sample_users transformer')
return (
spark.read.table("samples.wanderbricks.users")
.select("user_id", "email", "name", "user_type")
)After deploying the bundle, I tried running the job but it still fails to import the logger module.
Traceback (most recent call last):
File "/Workspace/Users/.../.bundle/import_test/dev/files/src/project/pipelines/import_test/transformations/sample_users.py", cell 1, line 3
1 from pyspark import pipelines as dp
----> 3 from project.utils.logger import get_logger
4 #from utils.logger import get_logger
6 @dp.table
7 def sample_users():
File "/databricks/python_shell/lib/dbruntime/autoreload/discoverability/hook.py", line 71, in AutoreloadDiscoverabilityHook._patched_import(self, name, *args, **kwargs)
65 if not self._should_hint and (
66 (module := sys.modules.get(absolute_name)) is not None and
67 (fname := get_allowed_file_name_or_none(module)) is not None and
68 (mtime := os.stat(fname).st_mtime) > self.last_mtime_by_modname.get(
69 absolute_name, float("inf")) and not self._should_hint):
70 self._should_hint = True
---> 71 module = self._original_builtins_import(name, *args, **kwargs)
72 if (fname := fname or get_allowed_file_name_or_none(module)) is not None:
73 mtime = mtime or os.stat(fname).st_mtime
ModuleNotFoundError: No module named 'project'
Error: [update_progress] Update 803be7 is FAILED.I tried importing from both "project.utils.logger" and "utils.logger" and neither worked.
Can you help me figure out what I'm doing wrong?
Tuesday
you have to __init__.py has to be under src/__init__.py folder. looks like you created it under utils/__init__.py
Tuesday
I tried that.
But import still fails.
ile "/Workspace/Users/.../.bundle/import_test/dev/files/src/project/pipelines/import_test/transformations/sample_users.py", cell 1, line 3
1 from pyspark import pipelines as dp
----> 3 from project.utils.logger import get_logger
4 # from utils.logger import get_logger
6 @dp.table
7 def sample_users():
File "/databricks/python_shell/lib/dbruntime/autoreload/discoverability/hook.py", line 71, in AutoreloadDiscoverabilityHook._patched_import(self, name, *args, **kwargs)
65 if not self._should_hint and (
66 (module := sys.modules.get(absolute_name)) is not None and
67 (fname := get_allowed_file_name_or_none(module)) is not None and
68 (mtime := os.stat(fname).st_mtime) > self.last_mtime_by_modname.get(
69 absolute_name, float("inf")) and not self._should_hint):
70 self._should_hint = True
---> 71 module = self._original_builtins_import(name, *args, **kwargs)
72 if (fname := fname or get_allowed_file_name_or_none(module)) is not None:
73 mtime = mtime or os.stat(fname).st_mtime
ModuleNotFoundError: No module named 'project'I'm new to Databricks, so I may not fully understand how bundle deployment work. I hope someone can share knowledge to help me with a problem what I feel trivial.
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now