Hello Community,
We are currently building a system in Databricks where multiple tasks are combined into a single job that produces final output data.
So far, our approach is based on Python notebooks (with asset bundles) that orchestrate the workflow. Each notebook calls functions from separate Python modules responsible for smaller processing steps. We can unit test the Python modules without issues, but testing the notebook logic itself is challenging. At the moment, the only way to validate the full flow is to run everything directly in Databricks.
Because of this limitation, we are considering replacing notebooks with pure Python files. Before making this change, I have a few questions:
How can variables be passed between tasks when using pure Python files?
Iโm familiar with passing variables between notebook tasks, but Iโm unsure how this would work with Python scripts.
What is the recommended approach for writing end-to-end (E2E) integration tests for a Databricks job consisting of multiple tasks?
What is the general recommendation โ notebooks or pure Python files?
Regardless of the option, what are the main benefits and trade-offs of each approach?
I would appreciate any insights or best practices based on your experience.
Thank you!