Databricks Community

AmitDECopilot · a month ago

I’ve been exploring a metadata-driven approach to data engineering through a project called Data Engineering Copilot.

The idea is to treat Source-to-Target Mapping (STTM) documents as structured metadata rather than static documentation.

Instead of manually translating STTM into Spark SQL, data quality checks, documentation, and pipelines, a Canonical Metadata Model could generate these artifacts automatically.

The workflow looks something like this:

STTM
↓
Canonical Metadata Model
↓
Spark SQL Generation
↓
Data Quality Rules
↓
Documentation
↓
Production Pipelines

I’m curious:

How are teams managing STTM today?
Are you using metadata-driven frameworks?
Has anyone experimented with generating Databricks assets directly from metadata?

Would love to hear how others are approaching this challenge.

Amit Kumar Singh
Lead Data Engineer | AI-Assisted Data Engineering

AmitDECopilot · a month ago

Great breakdown. In my experience, many organizations are currently somewhere between Level 1 and Level 2.

One possible next step could be:

Level 4 – AI-Assisted Metadata Engineering

Business Requirements
↓
STTM
↓
Canonical Metadata Model
↓
AI Validation
↓
SQL
PySpark
DQ Rules
Documentation
Lineage
Knowledge Discovery

The interesting shift is that metadata becomes the primary development artifact. Instead of engineers manually translating specifications into code, AI helps validate, enrich, and generate engineering artifacts from a governed metadata model, while humans remain responsible for final outcomes and deployment decisions

Amit Kumar Singh
Lead Data Engineer | AI-Assisted Data Engineering

View solution in original post

rdokala · a month ago

This is a good discussion topic, but from my experience right now it is both meta data driven and most traditional excel based STMs.

A few observations:

How most teams manage STTM today

Level 1 (Most Common)

STTM in Excel, Word, or Confluence.
Engineers manually translate mappings into Spark SQL, dbt, Informatica, ADF, etc.
Documentation becomes stale quickly.
Data quality rules are implemented separately from mappings.

Level 2 (Maturing Teams)

STTM stored in structured tables.
Reusable ETL framework reads metadata for:
- Source tables
- Target tables
- Incremental logic
- Column mappings
- Audit columns
Pipeline orchestration becomes metadata-driven.
Still, transformation logic is often manually coded.

Level 3 (Advanced Teams)

Metadata repository acts as the single source of truth.
Code generation produces:
- SQL
- ETL pipelines
- DQ rules
- Documentation
- Lineage
Human review before deployment.

AmitDECopilot · a month ago

Great breakdown. In my experience, many organizations are currently somewhere between Level 1 and Level 2.

One possible next step could be:

Level 4 – AI-Assisted Metadata Engineering

Business Requirements
↓
STTM
↓
Canonical Metadata Model
↓
AI Validation
↓
SQL
PySpark
DQ Rules
Documentation
Lineage
Knowledge Discovery

The interesting shift is that metadata becomes the primary development artifact. Instead of engineers manually translating specifications into code, AI helps validate, enrich, and generate engineering artifacts from a governed metadata model, while humans remain responsible for final outcomes and deployment decisions

Amit Kumar Singh
Lead Data Engineer | AI-Assisted Data Engineering

Databricks Community

From STTM to Databricks Pipelines: Can Metadata Become the Source Code of Data Engineering?

How most teams manage STTM today

Upcoming Community BrickTalk: Sports Analytics: Turning Tracking Data into Real-Time AI Decisions

How to Optimize Your Content for GEO: Best Practices for Writing Discoverable Community Content

Solution Accelerator Series | Building Common Sense Product Recommendations With LLMs

Databricks Community Fellows – June 2026 Recap

The Next Wave of Enterprise AI | Webinar