Databricks Community

Pratikmsbsvm · 6 hours ago

Hello,

I am building a Data Pipeline which extract data from Oracle Fusion and Push it to Databricks Delta lake.

I am using Bronze, Silver and Gold Approach.

May someone please help me how to control all three segment that is Bronze, Silver and Gold with Data Quality Framework.

So How to design Data Quality Framework for Medallion architecture in practical.

Thanks a lot for help.

Raman_Unifeye · 5 hours ago

Very Broad Topic. Let me try to break it and provide few key-points.

The most practical design involves defining Data Quality Expectations (rules) in DLT for each layer and implementing an automated process to validate the data against those rules.

Bronze: Focus on Completeness and Availability

The Bronze layer is your raw, immutable landing zone. The goal is to capture everything and avoid dropping data. Data Quality checks here are minimal and focus on the integrity of the ingestion process itself.

Silver: Focus on Validity, Consistency, and Uniqueness

The Silver layer is where raw data is cleaned, validated, conformed, and enriched. This is the most crucial stage for implementing business-specific quality rules.

Gold: Focus on Accuracy and Business Logic

The Gold layer is for final, aggregated, and curated business-ready data. Checks here confirm that the final transformation and aggregation logic is correct.

Reference Link for DLT/LDP - https://docs.databricks.com/aws/en/ldp/expectations

nayan_wylde · 4 hours ago

Here’s how you can implement DQ at each stage:

Bronze Layer

Checks:
- File format validation (CSV, JSON, etc.).
- Schema validation (column names, types).
- Row count vs. source system.
Tools:
- Use Databricks Autoloader with schema evolution and badRecordsPath
- Implement Great Expectations or Deequ for basic validations.

Silver Layer

Checks:
- Remove duplicates.
- Validate referential integrity (foreign keys).
- Standardize data types and formats.
Tools:
- Delta Live Tables (DLT) with expectations.
- Great Expectations for advanced profiling.
Automation:
- Define expectations in DLT pipelines (expectations block).
- Fail or quarantine bad records.

Gold Layer

Checks:
- Business rule validation (e.g., revenue > 0).
- KPI consistency checks.
- Aggregation accuracy.
Tools:
- DLT expectations or custom Spark jobs.
- Integrate with Unity Catalog for governance and lineage.