cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Declarative Pipelines with datacontracts

ismaelhenzel
Contributor

I'm wondering if anyone has successfully integrated data contracts with declarative pipelines in Databricks. Specifically, I want to reuse the quality checks and schema definitions from the contract directly within the pipeline's stages. I haven't found much information or many examples of this pattern in the Databricks ecosystem.

1 ACCEPTED SOLUTION

Accepted Solutions

Yes, exactly โ€” the approach I used for the Call Detail Records (CDRs) pipeline was to externalize both the schema definitions and data quality rules into a standardized YAML data contract and dynamically apply them within the DLT pipeline.

In your case, since you're already using DQX YAML for quality rules and defining schemas directly in @Dlt.table, you're already partway there.

To extend this into a fully contract-driven architecture, here's how I approached it:

 

Standardize the Data Contract (cdr_contract.yaml)
Unify schema and quality rules into a single YAML file. This serves as the single source of truth for raw CDR ingestion.

Dynamic Application in DLT (process_cdrs.py)
In the DLT notebook:

Use pyyaml to parse cdr_contract.yaml.
Construct the schema dynamically using Struct Type and Struct Field.
Apply expectations using @Dlt.expect_all(...) based on the checks section.
This makes the Bronze โ†’ Silver transformation layer fully driven by the YAML, ensuring schema alignment and reusable quality rules across stages.

View solution in original post

4 REPLIES 4

ManojkMohan
Contributor III

@ismaelhenzel 

Was working on  similar use case

The use case i was working on :

A source system generates raw Call Detail Records (CDRs) for every call, text, and data session on the network. Your goal is to build a reliable pipeline that cleans and prepares this data for a downstream fraud analytics team.

Solution:

  • Define the Data Contract
    Create a YAML file (cdr_contract.yaml) that serves as the source of truth for: Data schema (column names, types, constraints), Data quality rules (null checks, valid ranges, regex patterns, etc.) Store this YAML file in your Git repository alongside your pipeline code for version control and traceability.
  • Create the DLT Python Notebook
    In your Databricks workspace, create a Python notebook (process_cdrs.py).This notebook will: Read and parse cdr_contract.yaml. Implement ingestion, transformations, and data quality checks.
  • Implement the Bronze and Silver Layers. Bronze Layer: Ingest raw CDRs exactly as received. Preserve original data for replayability and auditing. Silver Layer: Apply transformations and validations based on cdr_contract.yaml. Standardize formats, remove duplicates, and enforce quality rules.
  • Configure and Run the DLT Pipeline
    Create a new pipeline and: Point it to your DLT notebook (process_cdrs.py).Configure the target schema/database. Start the pipeline to begin processing data.
  • Monitor Data Quality
    In the DLT UI, select your pipeline to view the graph. Review metrics for each quality rule from cdr_contract.yaml: Records passed Records failed. Update the YAML contract as rules evolve and re-run the pipeline to see metrics update automatically.

 

That's an interesting approach. Are you placing the schemas and quality rules in your DLT code based on reading a data contract? Currently, I'm using Databricks DQX YAML files to create my quality rules and defining the schemas directly in the @Dlt.table definitions. I would like to use a standardized data contract template to dynamically apply both schemas and quality rules in my DLT code.

Yes, exactly โ€” the approach I used for the Call Detail Records (CDRs) pipeline was to externalize both the schema definitions and data quality rules into a standardized YAML data contract and dynamically apply them within the DLT pipeline.

In your case, since you're already using DQX YAML for quality rules and defining schemas directly in @Dlt.table, you're already partway there.

To extend this into a fully contract-driven architecture, here's how I approached it:

 

Standardize the Data Contract (cdr_contract.yaml)
Unify schema and quality rules into a single YAML file. This serves as the single source of truth for raw CDR ingestion.

Dynamic Application in DLT (process_cdrs.py)
In the DLT notebook:

Use pyyaml to parse cdr_contract.yaml.
Construct the schema dynamically using Struct Type and Struct Field.
Apply expectations using @Dlt.expect_all(...) based on the checks section.
This makes the Bronze โ†’ Silver transformation layer fully driven by the YAML, ensuring schema alignment and reusable quality rules across stages.

WiliamRosa
New Contributor III

Suggested Steps:

Define the data contract

Create a YAML/JSON file containing:

Schema (column names, data types, required fields)

Data quality rules (null checks, ranges, regex patterns, allowed value lists)

Governance metadata (e.g., data sensitivity, LGPD classification)

Store the contract in a Git repository for versioning and auditing.

Create a library to read and interpret the contract

Implement a Python function that reads the file and returns a structured object (dict or DataFrame) for use in transformations.

Ensure support for multiple tables in the same contract to allow for more generic pipelines.

Implement the DLT notebook

Load the contract at the start of execution.

Ingest raw data (Bronze layer).

Apply transformations and validations based on the contract (Silver and/or Gold layers).

Generate quality rules automatically

Map each contract rule to @Dlt.expect or @Dlt.expect_or_drop instructions.

Create dynamic functions to avoid repetitive code.

Configure the declarative pipeline in Databricks

Point to the implemented notebook.

Define the target schema/database in Unity Catalog.

Configure alerts or notifications for quality failures.

Monitor and adjust continuously

Track metrics in the DLT UI (passed/failed records).

Update the contract and reprocess data without changing the transformation logic.

Automate contract updates (new step)

Integrate with APIs or catalog systems (e.g., Collibra, Purview) to automatically update contracts when source schema changes occur.

Ensure that changes are reviewed before being applied to the pipeline.

Test contracts before production (new step)

Create unit and integration tests to validate that contract rules work as expected using sample data.

Use pytest or QA notebooks in Databricks to validate changes.

Document and share best practices (new step)

Create documentation in the repository explaining how contracts work and how to update them.

Include examples of contracts and notebooks to speed up adoption by other teams.

Wiliam Rosa
Data Engineer | Machine Learning Engineer
LinkedIn: linkedin.com/in/wiliamrosa

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now