Part 3 of my series on building an enterprise data platform on Databricks is up - this one cover Gold layer design.
The short version: Gold isn't just aggregated Silver. Silver maps to your source system. Gold maps to the business questions your consumers are actually asking - and those two things are almost never the same shape.
What's in the post:
- MERGE vs overwrites for Gold writes, and the threshold where we switched (40min overwrite runs on vendor_balance at ~10M rows)
- Partitioning strategy for financial tables: BUKRS+GJAHR for period aggregates, BUKRS alone for balances, no partition on dimensions
- Z-ordering on LIFNR+MONAT for finance report query patterns
- SCD Type 2 from SAP master data using a validity window at Gold
- What doesn't belong in Gold — and the two days we spent auditing a table we eventually deleted
- Full vendor_balance Gold table in PySpark with MERGE pattern
This is Part 3 of 5. Parts 1 and 2 covered Bronze ingestion (GoldenGate + Kafka + Structured Streaming alongside JDBC historical load) and Silver reconciliation. Part 4 is about why three-layer medallion wasn't enough and what we added.
Full post: Designing the Gold Layer on Databricks — What Belongs and What Doesn’t
Happy to answer questions on any of the decisions — there were a few where we went back and forth longer than we should have.