I’m curious about Data Engineering best practices for a large-scale data engineering project using Databricks to build a Lakehouse architecture (Bronze -> Silver -> Gold layers).
I’m presently comparing two approaches of code writing to engineer the solution and want to be sure on which if any is considered the best approach:
- “Scripted” approach:
- Each notebook contains all operations, including common ones
- Minimal use of functions, no classes
- All code written out in each notebook for easy debugging
- Declare table attributes as standalone using common naming conventions, but no encapsulation (eg., table1_silver_tablename, table1_processed_df, table2_silver_tablename, table2_processed_df… etc)
- “Modular” approach:
- Common operations (e.g., environment setup, incremental reads, standard transformations, schema checks, Delta merges) stored in a shared codebase
- Use of classes for encapsulating table attributes and operations
- Custom transformations specific to each source kept separate
Both approaches handle the same tasks, including:
- Environment variable management
- Incremental source reading
- Standard transformations (e.g., file name parsing, deduplication)
- Schema validation
- Delta merging with insert/update date management
- Checkpointing and metadata management
However, “Modular” creates a separate module (or notebook) to which primary notebooks can call via import, magic command, or dbutils function; whereas “Scripting” rewrites these individually but for simplicity everything stays self contained inside its own notebook.
Question:
What is the industry best practice for Data Engineering for large scale projects in Databricks? To use a scripting approach for simplicity or modular approach for longterm sustainability? Is there a clear favorite?
Please provide references to established best practices or official documentation of such exists. Thank you!