Stop Writing ETL Code for Every New Pipeline
As organizations modernize their data platforms, one challenge continues to appear repeatedly:
Every new source requires another ETL job.
Every new business rule requires another SQL update.
Every schema change requires another deployment.
Over time, pipelines become difficult to maintain because business logic is scattered across notebooks, SQL scripts, stored procedures, and configuration files.
A metadata-driven framework addresses this problem by separating configuration from execution.
Instead of writing custom ETL code for every dataset, the framework reads metadata and dynamically generates the required transformations.
High-Level Architecture
Business Source Systems
โ
โผ
Source Configuration
โ
โผ
Intermediate Dataset
โ
โผ
Transformation Metadata
โ
โผ
Dynamic SQL Generation
โ
โผ
Databricks Execution
โ
โผ
Business Fact Tables
โ
โผ
Monitoring & Audit
Each layer has a single responsibility, making the pipeline easier to understand, maintain, and extend.
Step 1 โ Source Configuration
The first layer simply defines where the data comes from.
Example:
datasource_identifier: src_transaction_fact
mapping_file: transaction_fact_mapping.yaml
query: |
SELECT *
FROM transaction
No business logic exists here.
Its only responsibility is to expose the source data.
Step 2 โ Build an Intermediate Dataset
Instead of applying every business rule directly on the source, the framework creates an intermediate dataset.
This layer:
- Standardizes column names
- Renames fields
- Adds placeholder columns
- Applies default values
- Creates a normalized structure
Example:
SELECT
pol_id,
claim_id,
location_id,
CAST(NULL AS DECIMAL(38,8)) AS profit_amount_mon,
CAST(NULL AS DECIMAL(38,8)) AS profit_amount_ytd,
FROM transaction
This intermediate layer becomes the foundation for all downstream transformations.
Step 3 โ Store Business Logic as Metadata
Instead of embedding calculations directly in notebooks, describe them declaratively.
Example:
column: profit_amount_ytd
aggregation: SUM
window:
partition_by:
- policy_id
- location_id
order_by:
- month
This metadata describes what needs to happen rather than how to implement it.
Step 4 โ Generate SQL Dynamically
The framework converts metadata into executable SQL.
Instead of manually writing dozens of window functions, SQL is generated automatically.
Generated example:
SUM(profit_amount)
OVER(
PARTITION BY policy_id,
location_id
ORDER BY month
ROWS UNBOUNDED PRECEDING
)
This approach significantly reduces handwritten SQL.
Step 5 โ Reference Data and Business Rules
Many pipelines rely on lookup values such as:
- Currency mappings
- Product mappings
- Reporting categories
- Region mappings
- Business classifications
Rather than hardcoding CASE statements, these rules can be maintained in configurable reference tables.
Benefits include:
- Easier maintenance
- Business-owned mappings
- Reduced code changes
- Improved governance
Step 6 โ Execute on Databricks
Once metadata has been interpreted, Databricks executes the generated SQL.
Because execution is metadata-driven:
- New datasets require little or no new code
- New business rules are configuration changes
- Schema evolution becomes easier to manage
- Pipelines become more reusable
Step 7 โ Monitoring and Observability
A production framework should capture operational metadata for every execution.
Typical metrics include:
- Run ID
- Start Time
- End Time
- Pipeline Status
- Failed Component
- Row Counts
- Execution Duration
- Error Messages
This information helps engineering teams troubleshoot failures quickly.
Debugging Strategy
One of the biggest advantages of a layered architecture is easier debugging.
Instead of searching through thousands of lines of SQL, engineers can isolate the issue step by step.
Incorrect Output
โ
Generated SQL
โ
Transformation Metadata
โ
Intermediate Dataset
โ
Source Configuration
โ
Source Tables
Each layer can be validated independently.
Why This Design Works
Separating metadata from execution offers several advantages:
- Less handwritten SQL
- Easier onboarding for new engineers
- Centralized business logic
- Better governance
- Improved maintainability
- Faster debugging
- Easier schema evolution
- Reusable pipeline components
As data platforms continue to grow, metadata-driven architectures help engineering teams build pipelines that are easier to scale, easier to maintain, and more resilient to change.
Rather than treating every ETL job as a custom implementation, the framework turns pipeline development into a configuration-driven processโallowing engineers to focus on business logic instead of repetitive coding.
Amit Kumar Singh
Lead Data Engineer | AI-Assisted Data Engineering