Databricks Community

Rjt_de · ‎10-21-2024

In modern data-driven enterprises, data flows like lifeblood through complex systems and repositories to drive decision-making and innovation. Each dataset, whether structured or unstructured, holds the potential to unlock insights and drive innovation. However, transforming raw data into actionable intelligence is a complex challenge that demands a dynamic, adaptable framework capable of accelerating data ingestion and transformation.

In this three-part blog, we will cover the metadata setup (part 1), orchestration (part 2), and deployment process (part 3) of a configuration/metadata-driven ETL framework in Databricks following the Lakehouse architecture.

Why do we need an ETL Framework?

An ETL framework is essential for efficiently managing data integration processes. It standardizes data extraction, transformation, and loading, ensuring consistency, scalability, and reliability in modern data platforms.

While metadata-driven, automated ETL pipeline creation sounds appealing, it may introduce maintenance overhead and struggle to scale efficiently when the complexity of the transformation process varies too much.

What we need instead is a framework that is

Consistent - has a consistent template that developers can utilize to build pipelines quickly.
Modular - with modular components that can be reused across different data pipelines, which reduces duplication of effort and makes maintenance easier.

Scalable - able to support scalable ETL of different layers with various complexities.

Auditable and traceable—provides audit trails for tracking job run execution and errors, which is critical for compliance, debugging, and monitoring.

Seamlessly integrated with the CI-CD practices - should not add unnecessary burden to the CI-CD pipelines and ensures rapid deployment.

Significance of ETL Framework in Databricks

Having an ETL framework in Databricks is crucial for building a Lakehouse architecture as it streamlines the data integration, transformation, and management in the different data layers, such as the bronze, silver, and gold layers, which are core components of the lakehouse. The framework uses a templated approach to build data assets across all these layers to accelerate data transformation and enrichment.

High-Level Design

Lakehouse architecture

A typical Lakehouse architecture consists of the following layers:

Landing Data Layer

Raw Data Storage: Data from various sources (structured and unstructured) is extracted and landed as they were from the source in a cloud object storage.
Batch and Stream Ingestion: Data will be ingested from this layer in batch or real-time using various ingestion tools or Databricks’ native services (e.g., Autoloader, DLT, etc.)

Raw Data Layer (Bronze)

Data Ingestion: Data will be ingested from the landing data layer and stored as of the data source with delta format
Delta Lake: Ensures ACID transactions, schema enforcement, and unified handling of batch and streaming data

Curated Data Layer (Silver)

Data Transformation: Data is cleaned, enriched, and transformed within Databricks.
Data Quality Check: Data will be checked against specific data quality rules to ensure quality standards are met
Data Integration: the data from multiple sources could be integrated and consolidated for common use cases

Consumption Data Layer (Gold)

Data Modeling: Data are modelled for semantic consistency and read performance (e.g. Dimensional Modeling or denormalized)
SQL Analytics and BI Tools: Databricks DBSQL or BI tools will access the data in layers for visualization or quick insights

The ETL framework and the Lakehouse require governance, security control, and monitoring capabilities across all layers.

Data Governance: Unity Catalog provides centralized governance, access control, and data lineage.
Monitoring and Optimization: Built-in tools for job monitoring, resource management, and performance optimization.

ETL_Framework - E2EDesign (1).png

Metadata Driven Framework design

The framework can have multiple key components that are responsible for building each layer in the lakehouse architecture.

Control tables are the configuration required for specific ETL pipelines. The structure and model of the control tables and the approach of the execution notebooks change based on the tasks in the different layers (Bronze, Silver, Gold).
Execution scripts/notebooks leverage the configurations from the control tables to the execution
Log tables or the job-run audit tables provide complete audit and traceability of the job runs.

Data Model

Bronze

This is the layer where data is ingested in its original form. There are three key components to build the backbone of the bronze layer loading process: Bronze Control Tables, Generic Python Modules to load the bronze layer tables, and the workflow job to orchestrate the pipeline.

All the required configuration parameters are stored in the control table, and in the execution notebook, these parameters are passed to feed the autoloader process to load the bronze tables.

BRONZE_CONTROL_TABLE

This table holds the configuration details to load the bronze layer table; SCD type-2 is maintained by upserting new/updated records.


Columns	Description
id	Unique and incremental autogenerated surrogate key
source_system	Source system name
bronze_table_name	Table name
bronze_schema_name	Bronze layer schema name
source_file_type	The format of the source/raw files
file_location	The full path of the source file location
config_file_name	yaml file which is used to add the configuration
record_start_ts	scd type2 record start date time
record_end_ts	scd type2 record end date time
record_is_active	scd type2 record active flag

Generic Python execution modules

The execution notebook is invoked by various job tasks with specific parameters to access particular bronze tables. The execution notebook is designed to be reused and parameterized and serves as a generic program that can take in parameters from the workflow job tasks and use different modules to execute tasks like getting configuration details from the control tables and loading the bronze tables with auto-loader

Silver

In this layer, data quality checks and SCD1/2 loading are applied based on control table mappings. Records failing the checks are stored in quarantine tables, and depending on severity (e.g., "Drop" skips the record, while "Warning" allows it), they may or may not be loaded into the silver tables.

There are three key components to build the backbone of the Silver layer loading process: DQ Control Tables (DQ and Silver layer mapping), Generic Python Modules to do DQ checks and load the silver layer tables, and the workflow job to orchestrate the pipeline.

Silver layer control table structure

SILVER_CONTROL_TABLE

This table holds the configuration detail to load the silver layer tables, SCD type2 is maintained by upserting new/updated records.

Columns	Description
id	Unique and incremental autogenerated surrogate key
source_schema_name	source table’s schema name
table_name	table name
silver_schema_name	Silver layer schema name
load_type	The approach to loading the silver layer tables
config_file_name	yaml file which is used to add the configuration
record_start_ts	scd type2 record start date time
record_end_ts	scd type2 record end date time
record_is_active	scd type2 record active flag

DQ_RULES - This table is used as the rule registry where all the rules are defined along with the different configurations. Rule_ID needs to be unique and explicitly declared while creating new rule. SCD type2 is maintained while updating or inserting any new rules.

Columns	Description
id	Unique and incremental autogenerated surrogate key
rule_id	a unique id explicitly declared in the configuration while creating new rules
rule_type	type of the rules -sql or function
rule	for sql it’s the entire sql clause, for function just the function name
description	rule description
config_file_name	yaml file which is used to add the configuration
record_start_ts	scd type2 record start date time
record_end_ts	scd type2 record end date time
record_is_active	scd type2 record active flag

DQ_RULES_ASSIGNMENT - This table records the rule assignment to the particular columns of the tables. The ID is automatically generated based on the combination of the table, column and the rule_id. SCD type2 is maintained while updating or inserting any new record.

Columns	Description
id	Unique and incremental autogenerated surrogate key
table_name	Name of the table whose column is getting dq checked
column_name	Name of the column which is getting dq checked
rule_id	rule_id of the rule from the DQ_RULES table
severity	Type of the severity of the technical DQ check fail outcome.
config_file_name	yaml file which is used to add the configuration
record_start_ts	scd type2 record start date time
record_end_ts	scd type2 record end date time
record_is_active	scd type2 record active flag

Generic Python execution modules

The execution notebook is invoked by various job tasks with specific parameters to access particular silver tables. The execution notebook is a generic program that can take in parameters from the workflow job tasks and use different modules to execute tasks like getting configuration detail from the control tables, performing data quality checks, and loading silver tables with the right loading approach.

Gold

In this layer, the primary focus is on data transformation, where records are mapped to the correct attributes according to a specific data modelling methodology. For instance, in our case, we are using dimensional modelling, and building dimensions and facts as the gold layer tables sourcing from the silver layer tables.

This process also has three key components: Gold Control Tables, a Transformation notebook, and the workflow job to orchestrate the pipeline.

GOLD_CONTROL_TABLE

This table holds the configuration detail to load the gold layer tables, SCD type2 is maintained by upserting new/updated records

Columns	Description
id	Unique and incremental autogenerated surrogate key
schema_name	target table schema name
table_name	target table name which is built from the transformation
notebook_name	Transformation notebook
config_file_name	yaml file which is used to add the configuration
record_start_timestamp	scd type2 record start date time
record_end_timestamp	scd type2 record end date time
record_is_active	scd type2 record active flag

Generic Python execution modules

Audit Tables

Audit tables in an ETL framework are essential for tracking the success, failure, and performance of data processes, ensuring data integrity and transparency. They provide crucial logging information for troubleshooting and compliance, facilitating better monitoring and control of ETL workflows

DQ_Run_Audit table -

This audit table is to track and record all the data quality check results of each batch in the silver layer during the data curation process-

Columns	Description
batch_id	Master workflow job run id which is a unique identifier
dq_task_run_id	DQ pipeline run id
source_schema_name	table schema name which goes under the DQ check
table_name	table name which goes under the DQ check
target_schema_name	target table schema name
pipeline_status	data quality pipeline status
dq_check_outcome	to identify any quarantine records found or not
src_rec_count	source table record count
quarantine_count	quarantined record count
target_rec_count	target table record count
dq_task_start_timestamp	data quality check start timestamp
dq_task_end_timestamp	data quality check end timestamp

Job_Run_Audit table

This table is for tracking the status of all job runs across different ETL layers. The table can be further modelled to capture granularity at the task level.

Columns	Description
job_id	Job id of the workflow jobs
job_name	name of the job
job_run_id	Workflow job run id
layer	Data layer
job_status	Table load pipeline status
task_start_timestamp	data quality check start timestamp
task_end_timestamp	data quality check end timestamp

User Journey

An ETL framework should prioritize developer efficiency and rapid integration of new features or data sources. In this framework, there are two personas responsible for building the entire ETL process. Framework development involves building the foundation, which may undergo changes to add new capabilities like DQ rules, enhance auditing details, or improve functionality. The application developers work on multiple iterations to build new data pipelines. This involves ingesting new metadata into the control tables, which will then be used to build or update the bronze, silver and gold layer pipelines. Additionally, the developer will create transformation notebooks in the Gold layer.

Framework Developer’s responsibilities

Create and maintain the generic Bronze/Silver/Gold loader notebooks
Create and maintain modular utility methods to support repetitive tasks in the framework, such as autoloader-based bronze loading, data quality check, implementing SCD etc.
Create and maintain the control table data model.
Create and maintain pipelines to populate control tables from YML files, which will be used in the CI-CD pipeline.

Application Developer’s responsibilities

Create individual YML files for bronze_control, silver_control, dq_rules_assignment, and gold_control. A separate pipeline will ingest all these YML files into the control tables.
Create transformation notebooks for the Gold layer tables.

Conclusion

In this part, we covered the key factors to consider while designing the ETL framework, including the standard architecture, metadata model (the framework's backbone), and the developer's journey in building and maintaining it. While variations may exist to suit specific project needs, this serves as a reference architecture to guide the creation of a scalable, robust ETL framework.

In part 2, we'll explore orchestration and how it supports the framework-driven ETL architecture, seamlessly integrating with a standard CI/CD pipeline.

Databricks Community

Metadata-Driven ETL Framework in Databricks (Part-1)

Why do we need an ETL Framework?

Significance of ETL Framework in Databricks

High-Level Design

Lakehouse architecture

Metadata Driven Framework design

Data Model

Bronze

BRONZE_CONTROL_TABLE

Generic Python execution modules

Silver

Silver layer control table structure

SILVER_CONTROL_TABLE

Generic Python execution modules

Gold

GOLD_CONTROL_TABLE

Generic Python execution modules

Audit Tables

DQ_Run_Audit table -

Job_Run_Audit table

User Journey

Framework Developer’s responsibilities

Application Developer’s responsibilities

Conclusion

Metadata-Driven ETL Framework in Databricks (Part-1)

Best practices for safe data experimentation with Databricks

Top 10 query performance tuning tips for Databricks Serverless SQL