cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
Rjt_de
Databricks Employee
Databricks Employee

In modern data-driven enterprises, data flows like lifeblood through complex systems and repositories to drive decision-making and innovation. Each dataset, whether structured or unstructured, holds the potential to unlock insights and drive innovation. However, transforming raw data into actionable intelligence is a complex challenge that demands a dynamic, adaptable framework capable of accelerating data ingestion and transformation.

In this three-part blog, we will cover the metadata setup (part 1), orchestration (part 2), and deployment process (part 3) of a configuration/metadata-driven ETL framework in Databricks following the Lakehouse architecture.

 

Why do we need an ETL Framework?

An ETL framework is essential for efficiently managing data integration processes. It standardizes data extraction, transformation, and loading, ensuring consistency, scalability, and reliability in modern data platforms. 

While metadata-driven, automated ETL pipeline creation sounds appealing, it may introduce maintenance overhead and struggle to scale efficiently when the complexity of the transformation process varies too much.

What we need instead is a framework that is 

  • Consistent - has a consistent template that developers can utilize to build pipelines quickly.
  • Modular -  with modular components that can be reused across different data pipelines, which reduces duplication of effort and makes maintenance easier.
  • Scalable - able to support scalable ETL of different layers with various complexities.
  • Auditable and traceable—provides audit trails for tracking job run execution and errors, which is critical for compliance, debugging, and monitoring.
  • Seamlessly integrated with the CI-CD practices - should not add unnecessary burden to the CI-CD pipelines and ensures rapid deployment.

 

Significance of ETL Framework in Databricks

Having an ETL framework in Databricks is crucial for building a Lakehouse architecture as it streamlines the data integration, transformation, and management in the different data layers, such as the bronze, silver, and gold layers, which are core components of the lakehouse. The framework uses a templated approach to build data assets across all these layers to accelerate data transformation and enrichment.

 

High-Level Design

Lakehouse architecture

A typical Lakehouse architecture consists of the following layers: 

  1. Landing Data Layer
  • Raw Data Storage: Data from various sources (structured and unstructured) is extracted and landed as they were from the source in a cloud object storage.
  • Batch and Stream Ingestion: Data will be ingested from this layer in batch or real-time using various ingestion tools or Databricks’ native services (e.g., Autoloader, DLT, etc.)
  1. Raw Data Layer (Bronze)
  • Data Ingestion: Data will be ingested from the landing data layer and stored as of the data source with delta format
  • Delta Lake: Ensures ACID transactions, schema enforcement, and unified handling of batch and streaming data
  1. Curated Data Layer (Silver)
  • Data Transformation: Data is cleaned, enriched, and transformed within Databricks.
  • Data Quality Check: Data will be checked against specific data quality rules to ensure quality standards are met
  • Data Integration: the data from multiple sources could be integrated and consolidated for common use cases
  1. Consumption Data Layer (Gold)
  • Data Modeling: Data are modelled for semantic consistency and read performance (e.g. Dimensional Modeling or denormalized)
  • SQL Analytics and BI Tools: Databricks DBSQL or BI tools will access the data in layers for visualization or quick insights

The ETL framework and the Lakehouse require governance, security control, and monitoring capabilities across all layers.  

  • Data Governance: Unity Catalog provides centralized governance, access control, and data lineage.
  • Monitoring and Optimization: Built-in tools for job monitoring, resource management, and performance optimization.

 

ETL_Framework - E2EDesign (1).png

 

Metadata Driven Framework design

The framework can have multiple key components that are responsible for building each layer in the lakehouse architecture. 

  1. Control tables are the configuration required for specific ETL pipelines. The structure and model of the control tables and the approach of the execution notebooks change based on the tasks in the different layers (Bronze, Silver, Gold). 
  2. Execution scripts/notebooks leverage the configurations from the control tables to the execution
  3. Log tables or the job-run audit tables provide complete audit and traceability of the job runs.

Data Model

ETL_Framework_ER.jpeg

 

Bronze

This is the layer where data is ingested in its original form. There are three key components to build the backbone of the bronze layer loading process: Bronze Control Tables, Generic Python Modules to load the bronze layer tables, and the workflow job to orchestrate the pipeline. 

All the required configuration parameters are stored in the control table, and in the execution notebook, these parameters are passed to feed the autoloader process to load the bronze tables.

 

BRONZE_CONTROL_TABLE 

This table holds the configuration details to load the bronze layer table; SCD type-2 is maintained by upserting new/updated records.

 

Columns 

Description 

id

Unique and incremental autogenerated surrogate key

source_system 

Source system name

bronze_table_name 

Table name 

bronze_schema_name

Bronze layer schema name 

source_file_type

The format of the source/raw files

file_location

The full path of the source file location

config_file_name 

yaml file which is used to add the configuration

record_start_ts 

scd type2 record start date time 

record_end_ts 

scd type2 record end date time 

record_is_active 

scd type2 record active flag 

 

Generic Python execution modules

The execution notebook is invoked by various job tasks with specific parameters to access particular bronze tables. The execution notebook is designed to be reused and parameterized and serves as a generic program that can take in parameters from the workflow job tasks and use different modules to execute tasks like getting configuration details from the control tables and loading the bronze tables with auto-loader

bronze.jpg

 

Silver

In this layer, data quality checks and SCD1/2 loading are applied based on control table mappings. Records failing the checks are stored in quarantine tables, and depending on severity (e.g., "Drop" skips the record, while "Warning" allows it), they may or may not be loaded into the silver tables.

There are three key components to build the backbone of the Silver layer loading process: DQ Control Tables (DQ and Silver layer mapping), Generic Python Modules to do DQ checks and load the silver layer tables, and the workflow job to orchestrate the pipeline. 

 

Silver layer control table structure

SILVER_CONTROL_TABLE 

This table holds the configuration detail to load the silver layer tables, SCD type2 is maintained by upserting new/updated records.

 

Columns 

Description 

id

Unique and incremental autogenerated surrogate key

source_schema_name 

source table’s schema name 

table_name 

table name 

silver_schema_name

Silver layer schema name 

load_type

The approach to loading the silver layer tables

config_file_name 

yaml file which is used to add the configuration

record_start_ts 

scd type2 record start date time 

record_end_ts 

scd type2 record end date time 

record_is_active 

scd type2 record active flag 



 DQ_RULES - This table is used as the rule registry where all the rules are defined along with the different configurations. Rule_ID needs to be unique and explicitly declared while creating new rule. SCD type2 is maintained while updating or inserting any new rules. 

Columns 

Description 

id

Unique and incremental autogenerated surrogate key

rule_id 

a unique id explicitly declared in the configuration while creating new rules

rule_type 

type of the rules -sql or function 

rule 

for sql it’s the entire sql clause, for function just the function name 

description 

rule description 

config_file_name 

yaml file which is used to add the configuration

record_start_ts 

scd type2 record start date time 

record_end_ts 

scd type2 record end date time 

record_is_active 

scd type2 record active flag 

 

DQ_RULES_ASSIGNMENT - This table records the rule assignment to the particular columns of the tables. The ID is automatically generated based on the combination of the table, column and the rule_id. SCD type2 is maintained while updating or inserting any new record.

Columns 

Description 

id 

Unique and incremental autogenerated surrogate key

table_name 

Name of the table whose column is getting dq checked 

column_name 

Name of the column which is getting dq checked 

rule_id 

rule_id of the rule from the DQ_RULES table 

severity 

Type of the severity of the technical DQ check fail outcome. 

config_file_name 

yaml file which is used to add the configuration

record_start_ts

scd type2 record start date time 

record_end_ts

scd type2 record end date time 

record_is_active 

scd type2 record active flag 

 

Generic Python execution modules 

The execution notebook is invoked by various job tasks with specific parameters to access particular silver tables. The execution notebook is a generic program that can take in parameters from the workflow job tasks and use different modules to execute tasks like getting configuration detail from the control tables, performing data quality checks, and loading silver tables with the right loading approach.

 

silver.jpg

 

Gold

In this layer, the primary focus is on data transformation, where records are mapped to the correct attributes according to a specific data modelling methodology. For instance, in our case, we are using dimensional modelling, and building dimensions and facts as the gold layer tables sourcing from the silver layer tables.

This process also has three key components: Gold Control Tables, a Transformation notebook, and the workflow job to orchestrate the pipeline. 

 

GOLD_CONTROL_TABLE 

This table holds the configuration detail to load the gold layer tables, SCD type2 is maintained by upserting new/updated records

Columns 

Description 

id

Unique and incremental autogenerated surrogate key

schema_name 

target table schema name 

table_name 

target table name which is built from the transformation

notebook_name

Transformation notebook

config_file_name 

yaml file which is used to add the configuration

record_start_timestamp 

scd type2 record start date time 

record_end_timestamp 

scd type2 record end date time 

record_is_active 

scd type2 record active flag 

 

Generic Python execution modules

Rjt_de_0-1728912001867.png

 

Audit Tables

Audit tables in an ETL framework are essential for tracking the success, failure, and performance of data processes, ensuring data integrity and transparency. They provide crucial logging information for troubleshooting and compliance, facilitating better monitoring and control of ETL workflows

 

DQ_Run_Audit table -

This audit table is to track and record all the data quality check results of each batch in the silver layer during the data curation process-

Columns 

Description 

batch_id 

Master workflow job run id which is a unique identifier

dq_task_run_id 

DQ pipeline run id 

source_schema_name 

table schema name which goes under the DQ check 

table_name 

table name which goes under the DQ check 

target_schema_name 

target table schema name 

pipeline_status 

data quality pipeline status 

dq_check_outcome 

to identify any quarantine records found or not 

src_rec_count 

source table record count 

quarantine_count 

quarantined record count 

target_rec_count 

target table record count 

dq_task_start_timestamp 

data quality check start timestamp 

dq_task_end_timestamp 

data quality check end timestamp 



Job_Run_Audit table 

This table is for tracking the status of all job runs across different ETL layers. The table can be further modelled to capture granularity at the task level.

Columns 

Description 

job_id 

Job id of the workflow jobs 

job_name 

name of the job 

job_run_id 

Workflow job run id

layer 

Data layer 

job_status 

Table load pipeline status 

task_start_timestamp 

data quality check start timestamp 

task_end_timestamp 

data quality check end timestamp 



User Journey

An ETL framework should prioritize developer efficiency and rapid integration of new features or data sources. In this framework, there are two personas responsible for building the entire ETL process. Framework development involves building the foundation, which may undergo changes to add new capabilities like DQ rules, enhance auditing details, or improve functionality. The application developers work on multiple iterations to build new data pipelines. This involves ingesting new metadata into the control tables, which will then be used to build or update the bronze, silver and gold layer pipelines. Additionally, the developer will create transformation notebooks in the Gold layer.

cuj1.jpg

 

Framework Developer’s responsibilities

  1. Create and maintain the generic Bronze/Silver/Gold loader notebooks
  2. Create and maintain modular utility methods to support repetitive tasks in the framework, such as autoloader-based bronze loading, data quality check, implementing SCD etc.
  3. Create and maintain the control table data model.
  4. Create and maintain pipelines to populate control tables from YML files, which will be used in the CI-CD pipeline.

framework_dev_cuj.jpg

 

Application Developer’s responsibilities

  1. Create individual YML files for bronze_control, silver_control, dq_rules_assignment, and gold_control. A separate pipeline will ingest all these YML files into the control tables.
  2. Create transformation notebooks for the Gold layer tables.

app_dev_cuj.jpg

 

Conclusion

In this part, we covered the key factors to consider while designing the ETL framework, including the standard architecture, metadata model (the framework's backbone), and the developer's journey in building and maintaining it. While variations may exist to suit specific project needs, this serves as a reference architecture to guide the creation of a scalable, robust ETL framework.

In part 2, we'll explore orchestration and how it supports the framework-driven ETL architecture, seamlessly integrating with a standard CI/CD pipeline.

7 Comments
camrojas
New Contributor II

Excited to see the second part.

ArunPrasanthS
New Contributor II

Great stuff. Waiting for part 2

JesseS
New Contributor II

I'm eagerly awating the next part.  I do have some questions that I hope will get addressed, or maybe someone can share some insights on.

1. Are the framework tables stored as Delta tables in the lakehouse?  I envision a seperate config schema that would house these tables, but I haven't worked with ingesting YAML files before so a bit confused on this point.

2. Will there be sample code showing a potential implementation of the generic execution notebooks?  I am very much interested in how the audit tables get populated at run time with the statistics.  We homebrewed an ETL monitoring system that relies on orchestration from Azure Data Factory, and I am hoping to be able to replace that with something more streamlined.

3. The article mentioned a separate pipeline to ingest the YAML files - is that planned for a future part of this series?  I've tried searching the community forums but haven't been able to find much information on this, so I hope it is covered.

Excellent article, and I am excited to see what I can learn from the future parts so that I can apply them to my ETL environment.

h_h_ak
Contributor

 

Thanks, it looks great!

Do you have any code or repository examples to share? We implemented a similar approach, and we also added some metadata, such as PII table identifiers and additional table properties, to our configuration.

AjayVerma
New Contributor

Nice article ... eagerly waiting for part-2

Mario_D
New Contributor II

I'm just going to repeat what was said before, good article .......really, eagerly waiting for the follow up.

No pressure 😀

szymon_dybczak
Contributor III

Great article. Can't wait for second part! 🙂