cancel
Showing results for 
Search instead for 
Did you mean: 
Community Platform Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results for 
Search instead for 
Did you mean: 

When is it time to change from ETL in notebooks to whl/py?

Forssen
New Contributor II

Hi!
I would like some input/tips from the community regarding when is it time to go from a working solution in notebooks to something more "stable", like whl/py-files?

What are the pros/cons with notebooks compared to whl/py?

The way i structured things now is that i use notebooks as a orchestrator. The code is built as modules in py-files and just imported to the notebook. Everything needed for the etl to work is a config-file(yml or json), so nothing is hardcoded.

Thanks in advance 🙂

1 ACCEPTED SOLUTION

Accepted Solutions

Isi
Contributor

Hey @Forssen ,

My advice:

Using .py files and .whl packages is generally more secure and scalable, especially when working in a team. One of the key advantages is that code reviews and version control are much more efficient with .py files, as changes can be properly tracked via pull requests.

While notebooks can have permissions set for reading and version control, they are often harder to manage in collaborative environments. A common issue is that people forget to remove unnecessary display() statements or collect(), which makes reviewing and debugging easier in a notebook but is considered bad practice in production. In addition, a single "," inserted in the notebook accidentally can make your production job fail.

Advantages of .py and .whl over notebooks:

Better version control & code reviews (easier to track changes and enforce coding standards).
Better modularization & reusability (separating logic into reusable components).
Easier CI/CD integration (you can automate testing, packaging, and deployment).
More structured and maintainable codebase (better organization and scalability).

Disadvantages:

Harder debugging compared to notebooks (notebooks allow quick testing and visualization).
Steeper learning curve for new users who are used to interactive workflows.

Given your current setup, where you use notebooks only as orchestrators and keep your logic in .py modules, you already have a good balance. The next step could be fully transitioning orchestration to workflows (like Airflow or Databricks Jobs) and packaging your code into .whl files for better maintainability.

🙂

View solution in original post

2 REPLIES 2

Isi
Contributor

Hey @Forssen ,

My advice:

Using .py files and .whl packages is generally more secure and scalable, especially when working in a team. One of the key advantages is that code reviews and version control are much more efficient with .py files, as changes can be properly tracked via pull requests.

While notebooks can have permissions set for reading and version control, they are often harder to manage in collaborative environments. A common issue is that people forget to remove unnecessary display() statements or collect(), which makes reviewing and debugging easier in a notebook but is considered bad practice in production. In addition, a single "," inserted in the notebook accidentally can make your production job fail.

Advantages of .py and .whl over notebooks:

Better version control & code reviews (easier to track changes and enforce coding standards).
Better modularization & reusability (separating logic into reusable components).
Easier CI/CD integration (you can automate testing, packaging, and deployment).
More structured and maintainable codebase (better organization and scalability).

Disadvantages:

Harder debugging compared to notebooks (notebooks allow quick testing and visualization).
Steeper learning curve for new users who are used to interactive workflows.

Given your current setup, where you use notebooks only as orchestrators and keep your logic in .py modules, you already have a good balance. The next step could be fully transitioning orchestration to workflows (like Airflow or Databricks Jobs) and packaging your code into .whl files for better maintainability.

🙂

Forssen
New Contributor II

Hi!
Thanks for the reply and information!
I think i might keep some parts as notebooks, but only in workflows, since workflow variables cant be set any other way 😕

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now