topic Re: Best practices for code organization in large-scale Databricks ETL projects: Modular vs. Scripte in Data Engineering

Best practices for code organization in large-scale Databricks ETL projects: Modular vs. Scripted

ashap551 — Sun, 17 Nov 2024 10:01:07 GMT

I’m curious about Data Engineering best practices for a large-scale data engineering project using Databricks to build a Lakehouse architecture (Bronze -> Silver -> Gold layers).

I’m presently comparing two approaches of code writing to engineer the solution and want to be sure on which if any is considered the best approach:

“Scripted” approach:

Each notebook contains all operations, including common ones
Minimal use of functions, no classes
All code written out in each notebook for easy debugging
Declare table attributes as standalone using common naming conventions, but no encapsulation (eg., table1_silver_tablename, table1_processed_df, table2_silver_tablename, table2_processed_df… etc)

“Modular” approach:

Common operations (e.g., environment setup, incremental reads, standard transformations, schema checks, Delta merges) stored in a shared codebase
Use of classes for encapsulating table attributes and operations
Custom transformations specific to each source kept separate

Both approaches handle the same tasks, including:

Environment variable management
Incremental source reading
Standard transformations (e.g., file name parsing, deduplication)
Schema validation
Delta merging with insert/update date management
Checkpointing and metadata management

However, “Modular” creates a separate module (or notebook) to which primary notebooks can call via import, magic command, or dbutils function; whereas “Scripting” rewrites these individually but for simplicity everything stays self contained inside its own notebook.

Question:
What is the industry best practice for Data Engineering for large scale projects in Databricks? To use a scripting approach for simplicity or modular approach for longterm sustainability? Is there a clear favorite?

Please provide references to established best practices or official documentation of such exists. Thank you!

Re: Best practices for code organization in large-scale Databricks ETL projects: Modular vs. Scripte

szymon_dybczak — Sun, 17 Nov 2024 17:54:18 GMT

Hi @ashap551 ,

I would vote for modular approach which lets you reuse code and write unit test in simpler manner. Notebooks are for me only "clients" of these shared modules. You can take a look at official documentation where they're following similar approach:
Software engineering best practices for notebooks | Databricks on AWS

Re: Best practices for code organization in large-scale Databricks ETL projects: Modular vs. Scripte

ashap551 — Sun, 17 Nov 2024 18:48:28 GMT

Thank you szymon_dybczak. I agree that it is software best practice, and the documentation substantiates it.

I’m just thinking if new data engineering practices are starting to move away from functional and modular practices. If there is a movement towards self-contained notebooks. A few of my colleagues find it very difficult to follow a modular coding style, and strongly prefers to code it in place via a single script / single notebook. More traditional data engineers, who are used to modular, tend to code to not like to rewrite code, and prefer it the way you recommend here.

Wasn’t sure if you came across the same pattern.

Just trying to keep up with industry trends myself!