Building a Metadata-Driven ETL Framework on Databricks

AmitDECopilot
New Contributor III

 

Stop Writing ETL Code for Every New Pipeline

As organizations modernize their data platforms, one challenge continues to appear repeatedly:

Every new source requires another ETL job.

Every new business rule requires another SQL update.

Every schema change requires another deployment.

Over time, pipelines become difficult to maintain because business logic is scattered across notebooks, SQL scripts, stored procedures, and configuration files.

A metadata-driven framework addresses this problem by separating configuration from execution.

Instead of writing custom ETL code for every dataset, the framework reads metadata and dynamically generates the required transformations.

High-Level Architecture

Business Source Systems

         

         

Source Configuration

         

         

Intermediate Dataset

         

         

Transformation Metadata

         

         

Dynamic SQL Generation

         

         

Databricks Execution

         

         

Business Fact Tables

         

         

Monitoring & Audit

Each layer has a single responsibility, making the pipeline easier to understand, maintain, and extend.

 

Step 1 – Source Configuration

The first layer simply defines where the data comes from.

Example:

datasource_identifier: src_transaction_fact

 

mapping_file: transaction_fact_mapping.yaml

 

query: |

    SELECT *

    FROM transaction

No business logic exists here.

Its only responsibility is to expose the source data.

 

Step 2 – Build an Intermediate Dataset

Instead of applying every business rule directly on the source, the framework creates an intermediate dataset.

This layer:

  • Standardizes column names
  • Renames fields
  • Adds placeholder columns
  • Applies default values
  • Creates a normalized structure

Example:

SELECT

 

pol_id,

 

claim_id,

 

location_id,

 

CAST(NULL AS DECIMAL(38,8)) AS profit_amount_mon,

 

CAST(NULL AS DECIMAL(38,8)) AS profit_amount_ytd,

 

 

 

FROM transaction

This intermediate layer becomes the foundation for all downstream transformations.

 

Step 3 – Store Business Logic as Metadata

Instead of embedding calculations directly in notebooks, describe them declaratively.

Example:

column: profit_amount_ytd

 

aggregation: SUM

 

window:

 

  partition_by:

 

    - policy_id

 

    - location_id

 

  order_by:

 

    - month

This metadata describes what needs to happen rather than how to implement it.

 

Step 4 – Generate SQL Dynamically

The framework converts metadata into executable SQL.

Instead of manually writing dozens of window functions, SQL is generated automatically.

Generated example:

SUM(profit_amount)

 

OVER(

 

PARTITION BY policy_id,

 

location_id

 

ORDER BY month

 

ROWS UNBOUNDED PRECEDING

 

)

This approach significantly reduces handwritten SQL.

 

Step 5 – Reference Data and Business Rules

Many pipelines rely on lookup values such as:

  • Currency mappings
  • Product mappings
  • Reporting categories
  • Region mappings
  • Business classifications

Rather than hardcoding CASE statements, these rules can be maintained in configurable reference tables.

Benefits include:

  • Easier maintenance
  • Business-owned mappings
  • Reduced code changes
  • Improved governance

 

Step 6 – Execute on Databricks

Once metadata has been interpreted, Databricks executes the generated SQL.

Because execution is metadata-driven:

  • New datasets require little or no new code
  • New business rules are configuration changes
  • Schema evolution becomes easier to manage
  • Pipelines become more reusable

 

Step 7 – Monitoring and Observability

A production framework should capture operational metadata for every execution.

Typical metrics include:

  • Run ID
  • Start Time
  • End Time
  • Pipeline Status
  • Failed Component
  • Row Counts
  • Execution Duration
  • Error Messages

This information helps engineering teams troubleshoot failures quickly.

 

Debugging Strategy

One of the biggest advantages of a layered architecture is easier debugging.

Instead of searching through thousands of lines of SQL, engineers can isolate the issue step by step.

Incorrect Output

 

 

Generated SQL

 

 

Transformation Metadata

 

 

Intermediate Dataset

 

 

Source Configuration

 

 

Source Tables

Each layer can be validated independently.

 

Why This Design Works

Separating metadata from execution offers several advantages:

  • Less handwritten SQL
  • Easier onboarding for new engineers
  • Centralized business logic
  • Better governance
  • Improved maintainability
  • Faster debugging
  • Easier schema evolution
  • Reusable pipeline components

As data platforms continue to grow, metadata-driven architectures help engineering teams build pipelines that are easier to scale, easier to maintain, and more resilient to change.

Rather than treating every ETL job as a custom implementation, the framework turns pipeline development into a configuration-driven process—allowing engineers to focus on business logic instead of repetitive coding.

Amit Kumar Singh
Lead Data Engineer | AI-Assisted Data Engineering