Building a Metadata-Driven ETL Framework on Databr...

AmitDECopilot · Thursday

Stop Writing ETL Code for Every New Pipeline

As organizations modernize their data platforms, one challenge continues to appear repeatedly:

Every new source requires another ETL job.

Every new business rule requires another SQL update.

Every schema change requires another deployment.

Over time, pipelines become difficult to maintain because business logic is scattered across notebooks, SQL scripts, stored procedures, and configuration files.

A metadata-driven framework addresses this problem by separating configuration from execution.

Instead of writing custom ETL code for every dataset, the framework reads metadata and dynamically generates the required transformations.

High-Level Architecture

Business Source Systems

│

▼

Source Configuration

│

▼

Intermediate Dataset

│

▼

Transformation Metadata

│

▼

Dynamic SQL Generation

│

▼

Databricks Execution

│

▼

Business Fact Tables

│

▼

Monitoring & Audit

Each layer has a single responsibility, making the pipeline easier to understand, maintain, and extend.

Step 1 – Source Configuration

The first layer simply defines where the data comes from.

Example:

datasource_identifier: src_transaction_fact

mapping_file: transaction_fact_mapping.yaml

query: |

SELECT *

FROM transaction

No business logic exists here.

Its only responsibility is to expose the source data.

Step 2 – Build an Intermediate Dataset

Instead of applying every business rule directly on the source, the framework creates an intermediate dataset.

This layer:

Standardizes column names
Renames fields
Adds placeholder columns
Applies default values
Creates a normalized structure

Example:

SELECT

pol_id,

claim_id,

location_id,

CAST(NULL AS DECIMAL(38,8)) AS profit_amount_mon,

CAST(NULL AS DECIMAL(38,8)) AS profit_amount_ytd,

FROM transaction

This intermediate layer becomes the foundation for all downstream transformations.

Step 3 – Store Business Logic as Metadata

Instead of embedding calculations directly in notebooks, describe them declaratively.

Example:

column: profit_amount_ytd

aggregation: SUM

window:

partition_by:

- policy_id

- location_id

order_by:

- month

This metadata describes what needs to happen rather than how to implement it.

Step 4 – Generate SQL Dynamically

The framework converts metadata into executable SQL.

Instead of manually writing dozens of window functions, SQL is generated automatically.

Generated example:

SUM(profit_amount)

OVER(

PARTITION BY policy_id,

location_id

ORDER BY month

ROWS UNBOUNDED PRECEDING

)

This approach significantly reduces handwritten SQL.

Step 5 – Reference Data and Business Rules

Many pipelines rely on lookup values such as:

Currency mappings
Product mappings
Reporting categories
Region mappings
Business classifications

Rather than hardcoding CASE statements, these rules can be maintained in configurable reference tables.

Benefits include:

Easier maintenance
Business-owned mappings
Reduced code changes
Improved governance

Step 6 – Execute on Databricks

Once metadata has been interpreted, Databricks executes the generated SQL.

Because execution is metadata-driven:

New datasets require little or no new code
New business rules are configuration changes
Schema evolution becomes easier to manage
Pipelines become more reusable

Step 7 – Monitoring and Observability

A production framework should capture operational metadata for every execution.

Typical metrics include:

Run ID
Start Time
End Time
Pipeline Status
Failed Component
Row Counts
Execution Duration
Error Messages

This information helps engineering teams troubleshoot failures quickly.

Debugging Strategy

One of the biggest advantages of a layered architecture is easier debugging.

Instead of searching through thousands of lines of SQL, engineers can isolate the issue step by step.

Incorrect Output

↓

Generated SQL

↓

Transformation Metadata

↓

Intermediate Dataset

↓

Source Configuration

↓

Source Tables

Each layer can be validated independently.

Why This Design Works

Separating metadata from execution offers several advantages:

Less handwritten SQL
Easier onboarding for new engineers
Centralized business logic
Better governance
Improved maintainability
Faster debugging
Easier schema evolution
Reusable pipeline components

As data platforms continue to grow, metadata-driven architectures help engineering teams build pipelines that are easier to scale, easier to maintain, and more resilient to change.

Rather than treating every ETL job as a custom implementation, the framework turns pipeline development into a configuration-driven process—allowing engineers to focus on business logic instead of repetitive coding.

Amit Kumar Singh
Lead Data Engineer | AI-Assisted Data Engineering

Building a Metadata-Driven ETL Framework on Databricks