Best practices for code organization in large-scale Databricks ETL projects: Modular vs. Scripted
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
ā11-17-2024 02:01 AM
Iām curious about Data Engineering best practices for a large-scale data engineering project using Databricks to build a Lakehouse architecture (Bronze -> Silver -> Gold layers).
Iām presently comparing two approaches of code writing to engineer the solution and want to be sure on which if any is considered the best approach:
- āScriptedā approach:
- Each notebook contains all operations, including common ones
- Minimal use of functions, no classes
- All code written out in each notebook for easy debugging
- Declare table attributes as standalone using common naming conventions, but no encapsulation (eg., table1_silver_tablename, table1_processed_df, table2_silver_tablename, table2_processed_dfā¦ etc)
- āModularā approach:
- Common operations (e.g., environment setup, incremental reads, standard transformations, schema checks, Delta merges) stored in a shared codebase
- Use of classes for encapsulating table attributes and operations
- Custom transformations specific to each source kept separate
Both approaches handle the same tasks, including:
- Environment variable management
- Incremental source reading
- Standard transformations (e.g., file name parsing, deduplication)
- Schema validation
- Delta merging with insert/update date management
- Checkpointing and metadata management
However, āModularā creates a separate module (or notebook) to which primary notebooks can call via import, magic command, or dbutils function; whereas āScriptingā rewrites these individually but for simplicity everything stays self contained inside its own notebook.
Question:
What is the industry best practice for Data Engineering for large scale projects in Databricks? To use a scripting approach for simplicity or modular approach for longterm sustainability? Is there a clear favorite?
Please provide references to established best practices or official documentation of such exists. Thank you!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
ā11-17-2024 09:53 AM - edited ā11-17-2024 09:54 AM
Hi @ashap551 ,
I would vote for modular approach which lets you reuse code and write unit test in simpler manner. Notebooks are for me only "clients" of these shared modules. You can take a look at official documentation where they're following similar approach:
Software engineering best practices for notebooks | Databricks on AWS
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
ā11-17-2024 10:48 AM
Thank you szymon_dybczak. I agree that it is software best practice, and the documentation substantiates it.
Iām just thinking if new data engineering practices are starting to move away from functional and modular practices. If there is a movement towards self-contained notebooks. A few of my colleagues find it very difficult to follow a modular coding style, and strongly prefers to code it in place via a single script / single notebook. More traditional data engineers, who are used to modular, tend to code to not like to rewrite code, and prefer it the way you recommend here.
Wasnāt sure if you came across the same pattern.
Just trying to keep up with industry trends myself!

