cancel
Showing results for 
Search instead for 
Did you mean: 
Administration & Architecture
Explore discussions on Databricks administration, deployment strategies, and architectural best practices. Connect with administrators and architects to optimize your Databricks environment for performance, scalability, and security.
cancel
Showing results for 
Search instead for 
Did you mean: 

A single DLT for Ingest - feedback on this architecture

RDE305
New Contributor II

What are your thoughts on this Databricks pipeline design?

Different facilities will send me backups of a proprietary transactional database containing tens of thousands of tables. Each facility may have differences in how these tables are populated or differences in schema. 

  • Pre-Bronze:  Loop through the parquet files and create a registry of tables and schemas

  • Bronze: one large ingest DLT with schema inference and schema evolution, parameterized by ADLS2 location. This could end up being huge, if there's tens of thousands of tables but it will ingest all the columns, regardless of schema. If table_a has col_a, col_b and table_b has col_a, col_c, and tbl_c has col_b, col_c, the resulting ingest would have the structure of table_name, source_file, facility, col_a, col_b, col_c, ingest_timestamp. 

  • Silver: separate DLTs per facility + table, so each can evolve its own cleaning logic. Ideally I would extract rows from the ingest and use the maximal schema from the schema registry / cleaning logic / expectations for  individual silver tables. (i.e. SILVER_NYC_SECURITIES_TRANSACTIONS, SILVER_LON_SECURITIES_TRANSACTIONS)

  • Gold: unified layer that combines all facilities for enterprise‑wide analytics (i.e GOLD_SECURITIES_TRANSACTIONS).

Do you see this as a scalable, governed approach, or would you recommend a different pattern for balancing modularity, lineage clarity, and long‑term maintainability?

 
1 REPLY 1

nayan_wylde
Esteemed Contributor

Your design shows strong alignment with the Medallion Architecture principles and addresses schema variability well, but there are some scalability and governance considerations worth discussing. Also Pre-Bronze, Building a schema registry early is excellent for lineage and governance. It will help downstream processes know the maximal schema and track evolution.

Some of the potential challenges that I see are:

 

  • A single massive ingest table with tens of thousands of columns will be hard to manage and query.
  • Performance and storage overhead could become significant, especially if schema evolution adds sparse columns over time.
  • DLT can handle large pipelines, but tens of thousands of tables in one pipeline may hit operational limits (job orchestration, monitoring, debugging).
  • If Bronze is one giant table, lineage from source → Silver → Gold becomes opaque.

 

Some adjustments that I can see is 

  • Partition Bronze by Facility or Logical Domain. Instead of one giant ingest table, create multiple Bronze tables grouped by facility or domain.
  • Expectations & Quality Rules Early. Apply basic expectations in Bronze (e.g., non-null keys) to catch bad data before Silver.