cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Building a Metadata-Driven ETL Framework on Databricks

A0s01gy
New Contributor III

 

Stop Writing ETL Code for Every New Pipeline

As organizations modernize their data platforms, one challenge continues to appear repeatedly:

Every new source requires another ETL job.

Every new business rule requires another SQL update.

Every schema change requires another deployment.

Over time, pipelines become difficult to maintain because business logic is scattered across notebooks, SQL scripts, stored procedures, and configuration files.

A metadata-driven framework addresses this problem by separating configuration from execution.

Instead of writing custom ETL code for every dataset, the framework reads metadata and dynamically generates the required transformations.

High-Level Architecture

Business Source Systems

          โ”‚

          โ–ผ

Source Configuration

          โ”‚

          โ–ผ

Intermediate Dataset

          โ”‚

          โ–ผ

Transformation Metadata

          โ”‚

          โ–ผ

Dynamic SQL Generation

          โ”‚

          โ–ผ

Databricks Execution

          โ”‚

          โ–ผ

Business Fact Tables

          โ”‚

          โ–ผ

Monitoring & Audit

Each layer has a single responsibility, making the pipeline easier to understand, maintain, and extend.

 

Step 1 โ€“ Source Configuration

The first layer simply defines where the data comes from.

Example:

datasource_identifier: src_transaction_fact

 

mapping_file: transaction_fact_mapping.yaml

 

query: |

    SELECT *

    FROM transaction

No business logic exists here.

Its only responsibility is to expose the source data.

 

Step 2 โ€“ Build an Intermediate Dataset

Instead of applying every business rule directly on the source, the framework creates an intermediate dataset.

This layer:

  • Standardizes column names
  • Renames fields
  • Adds placeholder columns
  • Applies default values
  • Creates a normalized structure

Example:

SELECT

 

pol_id,

 

claim_id,

 

location_id,

 

CAST(NULL AS DECIMAL(38,8)) AS profit_amount_mon,

 

CAST(NULL AS DECIMAL(38,8)) AS profit_amount_ytd,

 

 

 

FROM transaction

This intermediate layer becomes the foundation for all downstream transformations.

 

Step 3 โ€“ Store Business Logic as Metadata

Instead of embedding calculations directly in notebooks, describe them declaratively.

Example:

column: profit_amount_ytd

 

aggregation: SUM

 

window:

 

  partition_by:

 

    - policy_id

 

    - location_id

 

  order_by:

 

    - month

This metadata describes what needs to happen rather than how to implement it.

 

Step 4 โ€“ Generate SQL Dynamically

The framework converts metadata into executable SQL.

Instead of manually writing dozens of window functions, SQL is generated automatically.

Generated example:

SUM(profit_amount)

 

OVER(

 

PARTITION BY policy_id,

 

location_id

 

ORDER BY month

 

ROWS UNBOUNDED PRECEDING

 

)

This approach significantly reduces handwritten SQL.

 

Step 5 โ€“ Reference Data and Business Rules

Many pipelines rely on lookup values such as:

  • Currency mappings
  • Product mappings
  • Reporting categories
  • Region mappings
  • Business classifications

Rather than hardcoding CASE statements, these rules can be maintained in configurable reference tables.

Benefits include:

  • Easier maintenance
  • Business-owned mappings
  • Reduced code changes
  • Improved governance

 

Step 6 โ€“ Execute on Databricks

Once metadata has been interpreted, Databricks executes the generated SQL.

Because execution is metadata-driven:

  • New datasets require little or no new code
  • New business rules are configuration changes
  • Schema evolution becomes easier to manage
  • Pipelines become more reusable

 

Step 7 โ€“ Monitoring and Observability

A production framework should capture operational metadata for every execution.

Typical metrics include:

  • Run ID
  • Start Time
  • End Time
  • Pipeline Status
  • Failed Component
  • Row Counts
  • Execution Duration
  • Error Messages

This information helps engineering teams troubleshoot failures quickly.

 

Debugging Strategy

One of the biggest advantages of a layered architecture is easier debugging.

Instead of searching through thousands of lines of SQL, engineers can isolate the issue step by step.

Incorrect Output

 

โ†“

 

Generated SQL

 

โ†“

 

Transformation Metadata

 

โ†“

 

Intermediate Dataset

 

โ†“

 

Source Configuration

 

โ†“

 

Source Tables

Each layer can be validated independently.

 

Why This Design Works

Separating metadata from execution offers several advantages:

  • Less handwritten SQL
  • Easier onboarding for new engineers
  • Centralized business logic
  • Better governance
  • Improved maintainability
  • Faster debugging
  • Easier schema evolution
  • Reusable pipeline components

As data platforms continue to grow, metadata-driven architectures help engineering teams build pipelines that are easier to scale, easier to maintain, and more resilient to change.

Rather than treating every ETL job as a custom implementation, the framework turns pipeline development into a configuration-driven processโ€”allowing engineers to focus on business logic instead of repetitive coding.

Amit Kumar Singh
Lead Data Engineer | AI-Assisted Data Engineering
0 REPLIES 0