- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
In modern data-driven enterprises, data flows like lifeblood through complex systems and repositories to drive decision-making and innovation. Each dataset, whether structured or unstructured, holds the potential to unlock insights and drive innovation. However, transforming raw data into actionable intelligence is a complex challenge that demands a dynamic, adaptable framework capable of accelerating data ingestion and transformation.
In this three-part blog, we will cover the metadata setup (part 1), orchestration (part 2), and deployment process (part 3) of a configuration/metadata-driven ETL framework in Databricks following the Lakehouse architecture.
Why do we need an ETL Framework?
An ETL framework is essential for efficiently managing data integration processes. It standardizes data extraction, transformation, and loading, ensuring consistency, scalability, and reliability in modern data platforms.
While metadata-driven, automated ETL pipeline creation sounds appealing, it may introduce maintenance overhead and struggle to scale efficiently when the complexity of the transformation process varies too much.
What we need instead is a framework that is
- Consistent - has a consistent template that developers can utilize to build pipelines quickly.
- Modular - with modular components that can be reused across different data pipelines, which reduces duplication of effort and makes maintenance easier.
- Scalable - able to support scalable ETL of different layers with various complexities.
- Auditable and traceable—provides audit trails for tracking job run execution and errors, which is critical for compliance, debugging, and monitoring.
- Seamlessly integrated with the CI-CD practices - should not add unnecessary burden to the CI-CD pipelines and ensures rapid deployment.
Significance of ETL Framework in Databricks
Having an ETL framework in Databricks is crucial for building a Lakehouse architecture as it streamlines the data integration, transformation, and management in the different data layers, such as the bronze, silver, and gold layers, which are core components of the lakehouse. The framework uses a templated approach to build data assets across all these layers to accelerate data transformation and enrichment.
High-Level Design
Lakehouse architecture
A typical Lakehouse architecture consists of the following layers:
- Landing Data Layer
- Raw Data Storage: Data from various sources (structured and unstructured) is extracted and landed as they were from the source in a cloud object storage.
- Batch and Stream Ingestion: Data will be ingested from this layer in batch or real-time using various ingestion tools or Databricks’ native services (e.g., Autoloader, DLT, etc.)
- Raw Data Layer (Bronze)
- Data Ingestion: Data will be ingested from the landing data layer and stored as of the data source with delta format
- Delta Lake: Ensures ACID transactions, schema enforcement, and unified handling of batch and streaming data
- Curated Data Layer (Silver)
- Data Transformation: Data is cleaned, enriched, and transformed within Databricks.
- Data Quality Check: Data will be checked against specific data quality rules to ensure quality standards are met
- Data Integration: the data from multiple sources could be integrated and consolidated for common use cases
- Consumption Data Layer (Gold)
- Data Modeling: Data are modelled for semantic consistency and read performance (e.g. Dimensional Modeling or denormalized)
- SQL Analytics and BI Tools: Databricks DBSQL or BI tools will access the data in layers for visualization or quick insights
The ETL framework and the Lakehouse require governance, security control, and monitoring capabilities across all layers.
- Data Governance: Unity Catalog provides centralized governance, access control, and data lineage.
- Monitoring and Optimization: Built-in tools for job monitoring, resource management, and performance optimization.
Metadata Driven Framework design
The framework can have multiple key components that are responsible for building each layer in the lakehouse architecture.
- Control tables are the configuration required for specific ETL pipelines. The structure and model of the control tables and the approach of the execution notebooks change based on the tasks in the different layers (Bronze, Silver, Gold).
- Execution scripts/notebooks leverage the configurations from the control tables to the execution
- Log tables or the job-run audit tables provide complete audit and traceability of the job runs.
Data Model
Bronze
This is the layer where data is ingested in its original form. There are three key components to build the backbone of the bronze layer loading process: Bronze Control Tables, Generic Python Modules to load the bronze layer tables, and the workflow job to orchestrate the pipeline.
All the required configuration parameters are stored in the control table, and in the execution notebook, these parameters are passed to feed the autoloader process to load the bronze tables.
BRONZE_CONTROL_TABLE
This table holds the configuration details to load the bronze layer table; SCD type-2 is maintained by upserting new/updated records.
Columns |
Description |
id |
Unique and incremental autogenerated surrogate key |
source_system |
Source system name |
bronze_table_name |
Table name |
bronze_schema_name |
Bronze layer schema name |
source_file_type |
The format of the source/raw files |
file_location |
The full path of the source file location |
config_file_name |
yaml file which is used to add the configuration |
record_start_ts |
scd type2 record start date time |
record_end_ts |
scd type2 record end date time |
record_is_active |
scd type2 record active flag |
Generic Python execution modules
The execution notebook is invoked by various job tasks with specific parameters to access particular bronze tables. The execution notebook is designed to be reused and parameterized and serves as a generic program that can take in parameters from the workflow job tasks and use different modules to execute tasks like getting configuration details from the control tables and loading the bronze tables with auto-loader
Silver
In this layer, data quality checks and SCD1/2 loading are applied based on control table mappings. Records failing the checks are stored in quarantine tables, and depending on severity (e.g., "Drop" skips the record, while "Warning" allows it), they may or may not be loaded into the silver tables.
There are three key components to build the backbone of the Silver layer loading process: DQ Control Tables (DQ and Silver layer mapping), Generic Python Modules to do DQ checks and load the silver layer tables, and the workflow job to orchestrate the pipeline.
Silver layer control table structure
SILVER_CONTROL_TABLE
This table holds the configuration detail to load the silver layer tables, SCD type2 is maintained by upserting new/updated records.
Columns |
Description |
---|---|
id |
Unique and incremental autogenerated surrogate key |
source_schema_name |
source table’s schema name |
table_name |
table name |
silver_schema_name |
Silver layer schema name |
load_type |
The approach to loading the silver layer tables |
config_file_name |
yaml file which is used to add the configuration |
record_start_ts |
scd type2 record start date time |
record_end_ts |
scd type2 record end date time |
record_is_active |
scd type2 record active flag |
DQ_RULES - This table is used as the rule registry where all the rules are defined along with the different configurations. Rule_ID needs to be unique and explicitly declared while creating new rule. SCD type2 is maintained while updating or inserting any new rules.
Columns |
Description |
---|---|
id |
Unique and incremental autogenerated surrogate key |
rule_id |
a unique id explicitly declared in the configuration while creating new rules |
rule_type |
type of the rules -sql or function |
rule |
for sql it’s the entire sql clause, for function just the function name |
description |
rule description |
config_file_name |
yaml file which is used to add the configuration |
record_start_ts |
scd type2 record start date time |
record_end_ts |
scd type2 record end date time |
record_is_active |
scd type2 record active flag |
DQ_RULES_ASSIGNMENT - This table records the rule assignment to the particular columns of the tables. The ID is automatically generated based on the combination of the table, column and the rule_id. SCD type2 is maintained while updating or inserting any new record.
Columns |
Description |
---|---|
id |
Unique and incremental autogenerated surrogate key |
table_name |
Name of the table whose column is getting dq checked |
column_name |
Name of the column which is getting dq checked |
rule_id |
rule_id of the rule from the DQ_RULES table |
severity |
Type of the severity of the technical DQ check fail outcome. |
config_file_name |
yaml file which is used to add the configuration |
record_start_ts |
scd type2 record start date time |
record_end_ts |
scd type2 record end date time |
record_is_active |
scd type2 record active flag |
Generic Python execution modules
The execution notebook is invoked by various job tasks with specific parameters to access particular silver tables. The execution notebook is a generic program that can take in parameters from the workflow job tasks and use different modules to execute tasks like getting configuration detail from the control tables, performing data quality checks, and loading silver tables with the right loading approach.
Gold
In this layer, the primary focus is on data transformation, where records are mapped to the correct attributes according to a specific data modelling methodology. For instance, in our case, we are using dimensional modelling, and building dimensions and facts as the gold layer tables sourcing from the silver layer tables.
This process also has three key components: Gold Control Tables, a Transformation notebook, and the workflow job to orchestrate the pipeline.
GOLD_CONTROL_TABLE
This table holds the configuration detail to load the gold layer tables, SCD type2 is maintained by upserting new/updated records
Columns |
Description |
---|---|
id |
Unique and incremental autogenerated surrogate key |
schema_name |
target table schema name |
table_name |
target table name which is built from the transformation |
notebook_name |
Transformation notebook |
config_file_name |
yaml file which is used to add the configuration |
record_start_timestamp |
scd type2 record start date time |
record_end_timestamp |
scd type2 record end date time |
record_is_active |
scd type2 record active flag |
Generic Python execution modules
Audit Tables
Audit tables in an ETL framework are essential for tracking the success, failure, and performance of data processes, ensuring data integrity and transparency. They provide crucial logging information for troubleshooting and compliance, facilitating better monitoring and control of ETL workflows
DQ_Run_Audit table -
This audit table is to track and record all the data quality check results of each batch in the silver layer during the data curation process-
Columns |
Description |
---|---|
batch_id |
Master workflow job run id which is a unique identifier |
dq_task_run_id |
DQ pipeline run id |
source_schema_name |
table schema name which goes under the DQ check |
table_name |
table name which goes under the DQ check |
target_schema_name |
target table schema name |
pipeline_status |
data quality pipeline status |
dq_check_outcome |
to identify any quarantine records found or not |
src_rec_count |
source table record count |
quarantine_count |
quarantined record count |
target_rec_count |
target table record count |
dq_task_start_timestamp |
data quality check start timestamp |
dq_task_end_timestamp |
data quality check end timestamp |
Job_Run_Audit table
This table is for tracking the status of all job runs across different ETL layers. The table can be further modelled to capture granularity at the task level.
Columns |
Description |
---|---|
job_id |
Job id of the workflow jobs |
job_name |
name of the job |
job_run_id |
Workflow job run id |
layer |
Data layer |
job_status |
Table load pipeline status |
task_start_timestamp |
data quality check start timestamp |
task_end_timestamp |
data quality check end timestamp |
User Journey
An ETL framework should prioritize developer efficiency and rapid integration of new features or data sources. In this framework, there are two personas responsible for building the entire ETL process. Framework development involves building the foundation, which may undergo changes to add new capabilities like DQ rules, enhance auditing details, or improve functionality. The application developers work on multiple iterations to build new data pipelines. This involves ingesting new metadata into the control tables, which will then be used to build or update the bronze, silver and gold layer pipelines. Additionally, the developer will create transformation notebooks in the Gold layer.
Framework Developer’s responsibilities
- Create and maintain the generic Bronze/Silver/Gold loader notebooks
- Create and maintain modular utility methods to support repetitive tasks in the framework, such as autoloader-based bronze loading, data quality check, implementing SCD etc.
- Create and maintain the control table data model.
- Create and maintain pipelines to populate control tables from YML files, which will be used in the CI-CD pipeline.
Application Developer’s responsibilities
- Create individual YML files for bronze_control, silver_control, dq_rules_assignment, and gold_control. A separate pipeline will ingest all these YML files into the control tables.
- Create transformation notebooks for the Gold layer tables.
Conclusion
In this part, we covered the key factors to consider while designing the ETL framework, including the standard architecture, metadata model (the framework's backbone), and the developer's journey in building and maintaining it. While variations may exist to suit specific project needs, this serves as a reference architecture to guide the creation of a scalable, robust ETL framework.
In part 2, we'll explore orchestration and how it supports the framework-driven ETL architecture, seamlessly integrating with a standard CI/CD pipeline.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.