Databricks Community

ashap551 · ‎11-17-2024

I’m curious about Data Engineering best practices for a large-scale data engineering project using Databricks to build a Lakehouse architecture (Bronze -> Silver -> Gold layers).

I’m presently comparing two approaches of code writing to engineer the solution and want to be sure on which if any is considered the best approach:

“Scripted” approach:

Each notebook contains all operations, including common ones
Minimal use of functions, no classes
All code written out in each notebook for easy debugging
Declare table attributes as standalone using common naming conventions, but no encapsulation (eg., table1_silver_tablename, table1_processed_df, table2_silver_tablename, table2_processed_df… etc)

“Modular” approach:

Common operations (e.g., environment setup, incremental reads, standard transformations, schema checks, Delta merges) stored in a shared codebase
Use of classes for encapsulating table attributes and operations
Custom transformations specific to each source kept separate

Both approaches handle the same tasks, including:

Environment variable management
Incremental source reading
Standard transformations (e.g., file name parsing, deduplication)
Schema validation
Delta merging with insert/update date management
Checkpointing and metadata management

However, “Modular” creates a separate module (or notebook) to which primary notebooks can call via import, magic command, or dbutils function; whereas “Scripting” rewrites these individually but for simplicity everything stays self contained inside its own notebook.

Question:
What is the industry best practice for Data Engineering for large scale projects in Databricks? To use a scripting approach for simplicity or modular approach for longterm sustainability? Is there a clear favorite?

Please provide references to established best practices or official documentation of such exists. Thank you!

szymon_dybczak · ‎11-17-2024

Hi @ashap551 ,

I would vote for modular approach which lets you reuse code and write unit test in simpler manner. Notebooks are for me only "clients" of these shared modules. You can take a look at official documentation where they're following similar approach:
Software engineering best practices for notebooks | Databricks on AWS

ashap551 · ‎11-17-2024

Thank you szymon_dybczak. I agree that it is software best practice, and the documentation substantiates it.

I’m just thinking if new data engineering practices are starting to move away from functional and modular practices. If there is a movement towards self-contained notebooks. A few of my colleagues find it very difficult to follow a modular coding style, and strongly prefers to code it in place via a single script / single notebook. More traditional data engineers, who are used to modular, tend to code to not like to rewrite code, and prefer it the way you recommend here.

Wasn’t sure if you came across the same pattern.

Just trying to keep up with industry trends myself!

Databricks Community

Best practices for code organization in large-scale Databricks ETL projects: Modular vs. Scripted

Join Us as a Local Community Builder!

PSA: Community Edition retires on January 1, 2026. Move to the Free Edition today to keep your work.

🎤 Call for Presentations: Data + AI Summit 2026 is Open!

Last Chance: Help Shape the 2026 Data + AI Summit | Win a Full Conference Pass

🌟 Community Pulse: Your Weekly Roundup! December 05 – 11, 2025

Jaipur Usergroup First Virtual Meetup: AI/BI Genie + Data Science Careers — 19 Dec | 6 PM IST