Below is how the folder structure of my project looks like:
resources/
|- etl_event/
|- etl_event.job.yml
src/
|- pipeline/
|- etl_event/
|- transformers/
|- transformer_1.py
|- utils/
|- logger.py
databricks.yml
I created a python module, logger.py, that I want to reuse in many transformers, so I put it in utils folder. I want transformer_1.py to import get_logger() function from the module:
# transformer_1.py
from pyspark import pipelines as dp
from utils.logger import get_logger
@dp.table
def load_events():
logger = get_logger(__name__)
logger.info('Loading events...')
...
After I deploy my project using DAB, running "etl_event" job fails because 'utils' module is not found.
I learned the following from researching the documents:
- PYTHONPATH is not set to the root of the bundle.
- One suggestion is to use `sys.path.append()` before each import. This feels rather a hack that I must remember to manipulate system path before each import.
- Another suggestion is to build a python wheel file. I'm not sure if that applies to my project because the module(utils/logger.py) is already included in bundle deployed. I should not need to build a separate wheel file.
My question is what is the proper way to configure my project/bundle so that I can have reusable python modules to be used in pipeline transformers?