- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sunday
I’ve been exploring a metadata-driven approach to data engineering through a project called Data Engineering Copilot.
The idea is to treat Source-to-Target Mapping (STTM) documents as structured metadata rather than static documentation.
Instead of manually translating STTM into Spark SQL, data quality checks, documentation, and pipelines, a Canonical Metadata Model could generate these artifacts automatically.
The workflow looks something like this:
STTM
↓
Canonical Metadata Model
↓
Spark SQL Generation
↓
Data Quality Rules
↓
Documentation
↓
Production Pipelines
I’m curious:
- How are teams managing STTM today?
- Are you using metadata-driven frameworks?
- Has anyone experimented with generating Databricks assets directly from metadata?
Would love to hear how others are approaching this challenge.
Lead Data Engineer | AI-Assisted Data Engineering
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tuesday
This is a good discussion topic, but from my experience right now it is both meta data driven and most traditional excel based STMs.
A few observations:
How most teams manage STTM today
Level 1 (Most Common)
- STTM in Excel, Word, or Confluence.
- Engineers manually translate mappings into Spark SQL, dbt, Informatica, ADF, etc.
- Documentation becomes stale quickly.
- Data quality rules are implemented separately from mappings.
Level 2 (Maturing Teams)
- STTM stored in structured tables.
- Reusable ETL framework reads metadata for:
- Source tables
- Target tables
- Incremental logic
- Column mappings
- Audit columns
- Pipeline orchestration becomes metadata-driven.
- Still, transformation logic is often manually coded.
Level 3 (Advanced Teams)
- Metadata repository acts as the single source of truth.
- Code generation produces:
- SQL
- ETL pipelines
- DQ rules
- Documentation
- Lineage
- Human review before deployment.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Wednesday
Great breakdown. In my experience, many organizations are currently somewhere between Level 1 and Level 2.
One possible next step could be:
Level 4 – AI-Assisted Metadata Engineering
Business Requirements
↓
STTM
↓
Canonical Metadata Model
↓
AI Validation
↓
SQL
PySpark
DQ Rules
Documentation
Lineage
Knowledge Discovery
The interesting shift is that metadata becomes the primary development artifact. Instead of engineers manually translating specifications into code, AI helps validate, enrich, and generate engineering artifacts from a governed metadata model, while humans remain responsible for final outcomes and deployment decisions
Lead Data Engineer | AI-Assisted Data Engineering