When is it time to change from ETL in notebooks to whl/py?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
3 weeks ago
Hi!
I would like some input/tips from the community regarding when is it time to go from a working solution in notebooks to something more "stable", like whl/py-files?
What are the pros/cons with notebooks compared to whl/py?
The way i structured things now is that i use notebooks as a orchestrator. The code is built as modules in py-files and just imported to the notebook. Everything needed for the etl to work is a config-file(yml or json), so nothing is hardcoded.
Thanks in advance 🙂
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
3 weeks ago
Hey @Forsen ,
My advice:
Using .py files and .whl packages is generally more secure and scalable, especially when working in a team. One of the key advantages is that code reviews and version control are much more efficient with .py files, as changes can be properly tracked via pull requests.
While notebooks can have permissions set for reading and version control, they are often harder to manage in collaborative environments. A common issue is that people forget to remove unnecessary display() statements or collect(), which makes reviewing and debugging easier in a notebook but is considered bad practice in production. In addition, a single "," inserted in the notebook accidentally can make your production job fail.
Advantages of .py and .whl over notebooks:
•Better version control & code reviews (easier to track changes and enforce coding standards).
•Better modularization & reusability (separating logic into reusable components).
•Easier CI/CD integration (you can automate testing, packaging, and deployment).
•More structured and maintainable codebase (better organization and scalability).
Disadvantages:
•Harder debugging compared to notebooks (notebooks allow quick testing and visualization).
•Steeper learning curve for new users who are used to interactive workflows.
Given your current setup, where you use notebooks only as orchestrators and keep your logic in .py modules, you already have a good balance. The next step could be fully transitioning orchestration to workflows (like Airflow or Databricks Jobs) and packaging your code into .whl files for better maintainability.
🙂
data:image/s3,"s3://crabby-images/618ac/618ac5f2bf7746f4cdeea1aaad5a0ab2f9192c1d" alt=""
data:image/s3,"s3://crabby-images/618ac/618ac5f2bf7746f4cdeea1aaad5a0ab2f9192c1d" alt=""