cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How to Design a Data Quality Framework for Medallion Architecture Data Pipeline

Pratikmsbsvm
Contributor

Hello,

I am building a Data Pipeline which extract data from Oracle Fusion and Push it to Databricks Delta lake.

I am using Bronze, Silver and Gold Approach.

May someone please help me how to control all three segment that is Bronze, Silver and Gold with Data Quality Framework.

So How to design Data Quality Framework for Medallion architecture in practical. 

Thanks a lot for help.

2 REPLIES 2

Raman_Unifeye
Contributor III

Very Broad Topic. Let me try to break it and provide few key-points.

The most practical design involves defining Data Quality Expectations (rules) in DLT for each layer and implementing an automated process to validate the data against those rules. 

Bronze: Focus on Completeness and Availability

The Bronze layer is your raw, immutable landing zone. The goal is to capture everything and avoid dropping data. Data Quality checks here are minimal and focus on the integrity of the ingestion process itself.

Silver: Focus on Validity, Consistency, and Uniqueness

The Silver layer is where raw data is cleaned, validated, conformed, and enriched. This is the most crucial stage for implementing business-specific quality rules.

Gold: Focus on Accuracy and Business Logic

The Gold layer is for final, aggregated, and curated business-ready data. Checks here confirm that the final transformation and aggregation logic is correct.

Reference Link for DLT/LDP - https://docs.databricks.com/aws/en/ldp/expectations

 

nayan_wylde
Esteemed Contributor

Hereโ€™s how you can implement DQ at each stage:

Bronze Layer

  • Checks:
    • File format validation (CSV, JSON, etc.).
    • Schema validation (column names, types).
    • Row count vs. source system.
  • Tools:
    • Use Databricks Autoloader with schema evolution and badRecordsPath
    • Implement Great Expectations or Deequ for basic validations.

Silver Layer

  • Checks:
    • Remove duplicates.
    • Validate referential integrity (foreign keys).
    • Standardize data types and formats.
  • Tools:
    • Delta Live Tables (DLT) with expectations.
    • Great Expectations for advanced profiling.
  • Automation:
    • Define expectations in DLT pipelines (expectations block).
    • Fail or quarantine bad records.

Gold Layer

  • Checks:
    • Business rule validation (e.g., revenue > 0).
    • KPI consistency checks.
    • Aggregation accuracy.
  • Tools:
    • DLT expectations or custom Spark jobs.
    • Integrate with Unity Catalog for governance and lineage.

Practical Tools

  • Great Expectations: Flexible, open-source, integrates with Databricks.
  • Delta Live Tables: Native expectations for Bronze/Silver/Gold.
  • AWS Deequ: For statistical checks.
  • Unity Catalog: Governance, lineage, and access control.