Gold Layer Design on Databricks — MERGE vs Overwri... - Databricks Community

Community Articles

Dive into a collaborative space where members like YOU can exchange knowledge, tips, and best practices. Join the conversation today and unlock a wealth of collective wisdom to enhance your experience and drive success.

Part 3 of my series on building an enterprise data platform on Databricks is up - this one cover Gold layer design.

The short version: Gold isn't just aggregated Silver. Silver maps to your source system. Gold maps to the business questions your consumers are actually asking - and those two things are almost never the same shape.

What's in the post:

MERGE vs overwrites for Gold writes, and the threshold where we switched (40min overwrite runs on vendor_balance at ~10M rows)
Partitioning strategy for financial tables: BUKRS+GJAHR for period aggregates, BUKRS alone for balances, no partition on dimensions
Z-ordering on LIFNR+MONAT for finance report query patterns
SCD Type 2 from SAP master data using a validity window at Gold
What doesn't belong in Gold — and the two days we spent auditing a table we eventually deleted
Full vendor_balance Gold table in PySpark with MERGE pattern

This is Part 3 of 5. Parts 1 and 2 covered Bronze ingestion (GoldenGate + Kafka + Structured Streaming alongside JDBC historical load) and Silver reconciliation. Part 4 is about why three-layer medallion wasn't enough and what we added.

Full post: Designing the Gold Layer on Databricks — What Belongs and What Doesn’t

Happy to answer questions on any of the decisions — there were a few where we went back and forth longer than we should have.