In modern data-driven enterprises, data flows like lifeblood through complex systems and repositories to drive decision-making and innovation. Each dataset, whether structured or unstructured, holds the potential to unlock insights and drive innovation. However, transforming raw data into actionable intelligence is a complex challenge that demands a dynamic, adaptable framework capable of accelerating data ingestion and transformation.
In this three-part blog, we will cover the metadata setup (part 1), orchestration (part 2), and deployment process (part 3) of a configuration/metadata-driven ETL framework in Databricks following the Lakehouse architecture.
An ETL framework is essential for efficiently managing data integration processes. It standardizes data extraction, transformation, and loading, ensuring consistency, scalability, and reliability in modern data platforms.
While metadata-driven, automated ETL pipeline creation sounds appealing, it may introduce maintenance overhead and struggle to scale efficiently when the complexity of the transformation process varies too much.
What we need instead is a framework that is
Having an ETL framework in Databricks is crucial for building a Lakehouse architecture as it streamlines the data integration, transformation, and management in the different data layers, such as the bronze, silver, and gold layers, which are core components of the lakehouse. The framework uses a templated approach to build data assets across all these layers to accelerate data transformation and enrichment.
A typical Lakehouse architecture consists of the following layers:
The ETL framework and the Lakehouse require governance, security control, and monitoring capabilities across all layers.
The framework can have multiple key components that are responsible for building each layer in the lakehouse architecture.
This is the layer where data is ingested in its original form. There are three key components to build the backbone of the bronze layer loading process: Bronze Control Tables, Generic Python Modules to load the bronze layer tables, and the workflow job to orchestrate the pipeline.
All the required configuration parameters are stored in the control table, and in the execution notebook, these parameters are passed to feed the autoloader process to load the bronze tables.
This table holds the configuration details to load the bronze layer table; SCD type-2 is maintained by upserting new/updated records.
Columns |
Description |
id |
Unique and incremental autogenerated surrogate key |
source_system |
Source system name |
bronze_table_name |
Table name |
bronze_schema_name |
Bronze layer schema name |
source_file_type |
The format of the source/raw files |
file_location |
The full path of the source file location |
config_file_name |
yaml file which is used to add the configuration |
record_start_ts |
scd type2 record start date time |
record_end_ts |
scd type2 record end date time |
record_is_active |
scd type2 record active flag |
The execution notebook is invoked by various job tasks with specific parameters to access particular bronze tables. The execution notebook is designed to be reused and parameterized and serves as a generic program that can take in parameters from the workflow job tasks and use different modules to execute tasks like getting configuration details from the control tables and loading the bronze tables with auto-loader
In this layer, data quality checks and SCD1/2 loading are applied based on control table mappings. Records failing the checks are stored in quarantine tables, and depending on severity (e.g., "Drop" skips the record, while "Warning" allows it), they may or may not be loaded into the silver tables.
There are three key components to build the backbone of the Silver layer loading process: DQ Control Tables (DQ and Silver layer mapping), Generic Python Modules to do DQ checks and load the silver layer tables, and the workflow job to orchestrate the pipeline.
This table holds the configuration detail to load the silver layer tables, SCD type2 is maintained by upserting new/updated records.
Columns |
Description |
---|---|
id |
Unique and incremental autogenerated surrogate key |
source_schema_name |
source table’s schema name |
table_name |
table name |
silver_schema_name |
Silver layer schema name |
load_type |
The approach to loading the silver layer tables |
config_file_name |
yaml file which is used to add the configuration |
record_start_ts |
scd type2 record start date time |
record_end_ts |
scd type2 record end date time |
record_is_active |
scd type2 record active flag |
DQ_RULES - This table is used as the rule registry where all the rules are defined along with the different configurations. Rule_ID needs to be unique and explicitly declared while creating new rule. SCD type2 is maintained while updating or inserting any new rules.
Columns |
Description |
---|---|
id |
Unique and incremental autogenerated surrogate key |
rule_id |
a unique id explicitly declared in the configuration while creating new rules |
rule_type |
type of the rules -sql or function |
rule |
for sql it’s the entire sql clause, for function just the function name |
description |
rule description |
config_file_name |
yaml file which is used to add the configuration |
record_start_ts |
scd type2 record start date time |
record_end_ts |
scd type2 record end date time |
record_is_active |
scd type2 record active flag |
DQ_RULES_ASSIGNMENT - This table records the rule assignment to the particular columns of the tables. The ID is automatically generated based on the combination of the table, column and the rule_id. SCD type2 is maintained while updating or inserting any new record.
Columns |
Description |
---|---|
id |
Unique and incremental autogenerated surrogate key |
table_name |
Name of the table whose column is getting dq checked |
column_name |
Name of the column which is getting dq checked |
rule_id |
rule_id of the rule from the DQ_RULES table |
severity |
Type of the severity of the technical DQ check fail outcome. |
config_file_name |
yaml file which is used to add the configuration |
record_start_ts |
scd type2 record start date time |
record_end_ts |
scd type2 record end date time |
record_is_active |
scd type2 record active flag |
The execution notebook is invoked by various job tasks with specific parameters to access particular silver tables. The execution notebook is a generic program that can take in parameters from the workflow job tasks and use different modules to execute tasks like getting configuration detail from the control tables, performing data quality checks, and loading silver tables with the right loading approach.
In this layer, the primary focus is on data transformation, where records are mapped to the correct attributes according to a specific data modelling methodology. For instance, in our case, we are using dimensional modelling, and building dimensions and facts as the gold layer tables sourcing from the silver layer tables.
This process also has three key components: Gold Control Tables, a Transformation notebook, and the workflow job to orchestrate the pipeline.
This table holds the configuration detail to load the gold layer tables, SCD type2 is maintained by upserting new/updated records
Columns |
Description |
---|---|
id |
Unique and incremental autogenerated surrogate key |
schema_name |
target table schema name |
table_name |
target table name which is built from the transformation |
notebook_name |
Transformation notebook |
config_file_name |
yaml file which is used to add the configuration |
record_start_timestamp |
scd type2 record start date time |
record_end_timestamp |
scd type2 record end date time |
record_is_active |
scd type2 record active flag |
Audit tables in an ETL framework are essential for tracking the success, failure, and performance of data processes, ensuring data integrity and transparency. They provide crucial logging information for troubleshooting and compliance, facilitating better monitoring and control of ETL workflows
This audit table is to track and record all the data quality check results of each batch in the silver layer during the data curation process-
Columns |
Description |
---|---|
batch_id |
Master workflow job run id which is a unique identifier |
dq_task_run_id |
DQ pipeline run id |
source_schema_name |
table schema name which goes under the DQ check |
table_name |
table name which goes under the DQ check |
target_schema_name |
target table schema name |
pipeline_status |
data quality pipeline status |
dq_check_outcome |
to identify any quarantine records found or not |
src_rec_count |
source table record count |
quarantine_count |
quarantined record count |
target_rec_count |
target table record count |
dq_task_start_timestamp |
data quality check start timestamp |
dq_task_end_timestamp |
data quality check end timestamp |
This table is for tracking the status of all job runs across different ETL layers. The table can be further modelled to capture granularity at the task level.
Columns |
Description |
---|---|
job_id |
Job id of the workflow jobs |
job_name |
name of the job |
job_run_id |
Workflow job run id |
layer |
Data layer |
job_status |
Table load pipeline status |
task_start_timestamp |
data quality check start timestamp |
task_end_timestamp |
data quality check end timestamp |
An ETL framework should prioritize developer efficiency and rapid integration of new features or data sources. In this framework, there are two personas responsible for building the entire ETL process. Framework development involves building the foundation, which may undergo changes to add new capabilities like DQ rules, enhance auditing details, or improve functionality. The application developers work on multiple iterations to build new data pipelines. This involves ingesting new metadata into the control tables, which will then be used to build or update the bronze, silver and gold layer pipelines. Additionally, the developer will create transformation notebooks in the Gold layer.
In this part, we covered the key factors to consider while designing the ETL framework, including the standard architecture, metadata model (the framework's backbone), and the developer's journey in building and maintaining it. While variations may exist to suit specific project needs, this serves as a reference architecture to guide the creation of a scalable, robust ETL framework.
In part 2, we'll explore orchestration and how it supports the framework-driven ETL architecture, seamlessly integrating with a standard CI/CD pipeline.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.