3 weeks ago
Below is how the folder structure of my project looks like:
resources/
|- etl_event/
|- etl_event.job.yml
src/
|- pipeline/
|- etl_event/
|- transformers/
|- transformer_1.py
|- utils/
|- logger.py
databricks.ymlI created a python module, logger.py, that I want to reuse in many transformers, so I put it in utils folder. I want transformer_1.py to import get_logger() function from the module:
# transformer_1.py
from pyspark import pipelines as dp
from utils.logger import get_logger
@dp.table
def load_events():
logger = get_logger(__name__)
logger.info('Loading events...')
...After I deploy my project using DAB, running "etl_event" job fails because 'utils' module is not found.
I learned the following from researching the documents:
My question is what is the proper way to configure my project/bundle so that I can have reusable python modules to be used in pipeline transformers?
3 weeks ago
Under src I would add the parent directory for pipeline and utils and put there __init__.py
resources/
|- etl_event/
|- etl_event.job.yml
src/
|- project/
__init__.py
|- pipeline/
|- etl_event/
|- transformers/
|- transformer_1.py
|- utils/
__init__.py
|- logger.py
databricks.yml
3 weeks ago
Thank you @Hubert-Dudek . This is very useful.
3 weeks ago
@Hubert-Dudek, thank you for your response. I really appreciate it.
However, I still cannot get the import to work even after I follow your instructions. Here is the folder structure:
The transformer code is below:
from pyspark import pipelines as dp
from project.utils.logger import get_logger
#from utils.logger import get_logger
@dp.table
def sample_users():
logger = get_logger(__name__)
logger.info('Running sample_users transformer')
return (
spark.read.table("samples.wanderbricks.users")
.select("user_id", "email", "name", "user_type")
)After deploying the bundle, I tried running the job but it still fails to import the logger module.
Traceback (most recent call last):
File "/Workspace/Users/.../.bundle/import_test/dev/files/src/project/pipelines/import_test/transformations/sample_users.py", cell 1, line 3
1 from pyspark import pipelines as dp
----> 3 from project.utils.logger import get_logger
4 #from utils.logger import get_logger
6 @dp.table
7 def sample_users():
File "/databricks/python_shell/lib/dbruntime/autoreload/discoverability/hook.py", line 71, in AutoreloadDiscoverabilityHook._patched_import(self, name, *args, **kwargs)
65 if not self._should_hint and (
66 (module := sys.modules.get(absolute_name)) is not None and
67 (fname := get_allowed_file_name_or_none(module)) is not None and
68 (mtime := os.stat(fname).st_mtime) > self.last_mtime_by_modname.get(
69 absolute_name, float("inf")) and not self._should_hint):
70 self._should_hint = True
---> 71 module = self._original_builtins_import(name, *args, **kwargs)
72 if (fname := fname or get_allowed_file_name_or_none(module)) is not None:
73 mtime = mtime or os.stat(fname).st_mtime
ModuleNotFoundError: No module named 'project'
Error: [update_progress] Update 803be7 is FAILED.I tried importing from both "project.utils.logger" and "utils.logger" and neither worked.
Can you help me figure out what I'm doing wrong?
2 weeks ago
you have to __init__.py has to be under src/__init__.py folder. looks like you created it under utils/__init__.py
2 weeks ago
I tried that.
But import still fails.
ile "/Workspace/Users/.../.bundle/import_test/dev/files/src/project/pipelines/import_test/transformations/sample_users.py", cell 1, line 3
1 from pyspark import pipelines as dp
----> 3 from project.utils.logger import get_logger
4 # from utils.logger import get_logger
6 @dp.table
7 def sample_users():
File "/databricks/python_shell/lib/dbruntime/autoreload/discoverability/hook.py", line 71, in AutoreloadDiscoverabilityHook._patched_import(self, name, *args, **kwargs)
65 if not self._should_hint and (
66 (module := sys.modules.get(absolute_name)) is not None and
67 (fname := get_allowed_file_name_or_none(module)) is not None and
68 (mtime := os.stat(fname).st_mtime) > self.last_mtime_by_modname.get(
69 absolute_name, float("inf")) and not self._should_hint):
70 self._should_hint = True
---> 71 module = self._original_builtins_import(name, *args, **kwargs)
72 if (fname := fname or get_allowed_file_name_or_none(module)) is not None:
73 mtime = mtime or os.stat(fname).st_mtime
ModuleNotFoundError: No module named 'project'I'm new to Databricks, so I may not fully understand how bundle deployment work. I hope someone can share knowledge to help me with a problem what I feel trivial.
yesterday
Just to share knowledge with other readers, I think the best option for now(until Databricks supports adding project root to PYTHONPATH) is to refactor the project to have custom modules in a separate folder, create a wheel when building DAB and declare as dependency of pipeline/job config.
I followed the instruction in this page, but struggled to make it work. It turned out dependency config is different if pipeline runs on serverless. Instead of using "libraries" as in the instruction , I have to use "environment".
resources:
pipelines:
<pipeline name>
environment:
dependencies:
- <path to wheel file>I wish the instruction included this example for serverless, or Databricks supports same configuration regardless of compute type.