cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Best practices for code organization in large-scale Databricks ETL projects: Modular vs. Scripted

ashap551
New Contributor II

Iโ€™m curious about Data Engineering best practices for a large-scale data engineering project using Databricks to build a Lakehouse architecture (Bronze -> Silver -> Gold layers).

Iโ€™m presently comparing two approaches of code writing to engineer the solution and want to be sure on which if any is considered the best approach:

  1. โ€œScriptedโ€ approach:
  • Each notebook contains all operations, including common ones
  • Minimal use of functions, no classes
  • All code written out in each notebook for easy debugging
  • Declare table attributes as standalone using common naming conventions, but no encapsulation (eg., table1_silver_tablename, table1_processed_df, table2_silver_tablename, table2_processed_dfโ€ฆ etc)
  1. โ€œModularโ€ approach:
  • Common operations (e.g., environment setup, incremental reads, standard transformations, schema checks, Delta merges) stored in a shared codebase
  • Use of classes for encapsulating table attributes and operations
  • Custom transformations specific to each source kept separate

Both approaches handle the same tasks, including:

  • Environment variable management
  • Incremental source reading
  • Standard transformations (e.g., file name parsing, deduplication)
  • Schema validation
  • Delta merging with insert/update date management
  • Checkpointing and metadata management

However, โ€œModularโ€ creates a separate module (or notebook) to which primary notebooks can call via import, magic command, or dbutils function; whereas โ€œScriptingโ€ rewrites these individually but for simplicity everything stays self contained inside its own notebook.

Question:
What is the industry best practice for Data Engineering for large scale projects in Databricks? To use a scripting approach for simplicity or modular approach for longterm sustainability?  Is there a clear favorite?

Please provide references to established best practices or official documentation of such exists. Thank you!

2 REPLIES 2

szymon_dybczak
Contributor III

Hi @ashap551 ,

I would vote for modular approach which lets you reuse code and write unit test in simpler manner. Notebooks are for me only "clients" of these shared modules. You can take a look at official documentation where they're following similar approach:
Software engineering best practices for notebooks | Databricks on AWS

Thank you szymon_dybczak.  I agree that it is software best practice, and the documentation substantiates it.

Iโ€™m just thinking if new data engineering practices are starting to move away from functional and modular practices.  If there is a movement towards self-contained notebooks.  A few of my colleagues find it very difficult to follow a modular coding style, and strongly prefers to code it in place via a single script / single notebook.  More traditional data engineers, who are used to modular, tend to code to not like to rewrite code, and prefer it the way you recommend here.  

Wasnโ€™t sure if you came across the same pattern.  

Just trying to keep up with industry trends myself!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group