Databricks Community

RDE305 · 3 weeks ago

What are your thoughts on this Databricks pipeline design?
Different facilities will send me backups of a proprietary transactional database containing tens of thousands of tables. Each facility may have differences in how these tables are populated or differences in schema.
Pre-Bronze: Loop through the parquet files and create a registry of tables and schemas
Bronze: one large ingest DLT with schema inference and schema evolution, parameterized by ADLS2 location. This could end up being huge, if there's tens of thousands of tables but it will ingest all the columns, regardless of schema. If table_a has col_a, col_b and table_b has col_a, col_c, and tbl_c has col_b, col_c, the resulting ingest would have the structure of table_name, source_file, facility, col_a, col_b, col_c, ingest_timestamp.
Silver: separate DLTs per facility + table, so each can evolve its own cleaning logic. Ideally I would extract rows from the ingest and use the maximal schema from the schema registry / cleaning logic / expectations for individual silver tables. (i.e. SILVER_NYC_SECURITIES_TRANSACTIONS, SILVER_LON_SECURITIES_TRANSACTIONS)
Gold: unified layer that combines all facilities for enterprise‑wide analytics (i.e GOLD_SECURITIES_TRANSACTIONS).
Do you see this as a scalable, governed approach, or would you recommend a different pattern for balancing modularity, lineage clarity, and long‑term maintainability?

nayan_wylde · 3 weeks ago

Your design shows strong alignment with the Medallion Architecture principles and addresses schema variability well, but there are some scalability and governance considerations worth discussing. Also Pre-Bronze, Building a schema registry early is excellent for lineage and governance. It will help downstream processes know the maximal schema and track evolution.

Some of the potential challenges that I see are:

A single massive ingest table with tens of thousands of columns will be hard to manage and query.
Performance and storage overhead could become significant, especially if schema evolution adds sparse columns over time.
DLT can handle large pipelines, but tens of thousands of tables in one pipeline may hit operational limits (job orchestration, monitoring, debugging).
If Bronze is one giant table, lineage from source → Silver → Gold becomes opaque.

Some adjustments that I can see is

Partition Bronze by Facility or Logical Domain. Instead of one giant ingest table, create multiple Bronze tables grouped by facility or domain.
Expectations & Quality Rules Early. Apply basic expectations in Bronze (e.g., non-null keys) to catch bad data before Silver.

Databricks Community

A single DLT for Ingest - feedback on this architecture

Join Us as a Local Community Builder!

Lakehouse, Lagers & Legends — Bangalore Meetup | December 13

🌟 Community Pulse: Your Weekly Roundup! November 21 – 27, 2025

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐