cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

From STTM to Databricks Pipelines: Can Metadata Become the Source Code of Data Engineering?

A0s01gy
New Contributor II

I’ve been exploring a metadata-driven approach to data engineering through a project called Data Engineering Copilot.

The idea is to treat Source-to-Target Mapping (STTM) documents as structured metadata rather than static documentation.

Instead of manually translating STTM into Spark SQL, data quality checks, documentation, and pipelines, a Canonical Metadata Model could generate these artifacts automatically.

The workflow looks something like this:

STTM

Canonical Metadata Model

Spark SQL Generation

Data Quality Rules

Documentation

Production Pipelines

I’m curious:

  1. How are teams managing STTM today?
  2. Are you using metadata-driven frameworks?
  3. Has anyone experimented with generating Databricks assets directly from metadata?

Would love to hear how others are approaching this challenge.

Amit Kumar Singh
Lead Data Engineer | AI-Assisted Data Engineering
1 ACCEPTED SOLUTION

Accepted Solutions

A0s01gy
New Contributor II

Great breakdown. In my experience, many organizations are currently somewhere between Level 1 and Level 2.

 

One possible next step could be:

 

Level 4 – AI-Assisted Metadata Engineering

Business Requirements

STTM

Canonical Metadata Model

AI Validation

SQL
PySpark
DQ Rules
Documentation
Lineage
Knowledge Discovery

The interesting shift is that metadata becomes the primary development artifact. Instead of engineers manually translating specifications into code, AI helps validate, enrich, and generate engineering artifacts from a governed metadata model, while humans remain responsible for final outcomes and deployment decisions

Amit Kumar Singh
Lead Data Engineer | AI-Assisted Data Engineering

View solution in original post

2 REPLIES 2

rdokala
New Contributor III

This is a good discussion topic, but from my experience right now it is both meta data driven and most traditional excel based STMs.

A few observations:

How most teams manage STTM today

Level 1 (Most Common)

  • STTM in Excel, Word, or Confluence.
  • Engineers manually translate mappings into Spark SQL, dbt, Informatica, ADF, etc.
  • Documentation becomes stale quickly.
  • Data quality rules are implemented separately from mappings.

Level 2 (Maturing Teams)

  • STTM stored in structured tables.
  • Reusable ETL framework reads metadata for:
    • Source tables
    • Target tables
    • Incremental logic
    • Column mappings
    • Audit columns
  • Pipeline orchestration becomes metadata-driven.
  • Still, transformation logic is often manually coded.

Level 3 (Advanced Teams)

  • Metadata repository acts as the single source of truth.
  • Code generation produces:
    • SQL
    • ETL pipelines
    • DQ rules
    • Documentation
    • Lineage
  • Human review before deployment.

A0s01gy
New Contributor II

Great breakdown. In my experience, many organizations are currently somewhere between Level 1 and Level 2.

 

One possible next step could be:

 

Level 4 – AI-Assisted Metadata Engineering

Business Requirements

STTM

Canonical Metadata Model

AI Validation

SQL
PySpark
DQ Rules
Documentation
Lineage
Knowledge Discovery

The interesting shift is that metadata becomes the primary development artifact. Instead of engineers manually translating specifications into code, AI helps validate, enrich, and generate engineering artifacts from a governed metadata model, while humans remain responsible for final outcomes and deployment decisions

Amit Kumar Singh
Lead Data Engineer | AI-Assisted Data Engineering