- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-05-2025 07:44 AM
Dear Community Experts,
I need your expert advice and suggestions on development of data quality framework. What are powerfull data quality tools or libraries are good to go for development of data quality framework in Databricks ?
Please guide team.
Regards,
Shubham
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-06-2025 10:21 AM
A year ago we did a bake-off with Soda Core, Great Expectations, deequ and DLT Expectations. Hands-down you want to use DLT expectations. It's built in to DLT and works seamlessly in your pipelines, can quarantine bad data and output statistics.
Since some of our data can be updated, not all of our pipelines can use DLT and we can't use DLT Expectations. I have recently done a small POC with Cuallee, https://github.com/canimus/cuallee. It worked nicely in Databricks and might make a good alternative in these cases.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-05-2025 09:07 AM
Hi @shubham_007 ,
Databricks DLT gives you ability to define data quality rules. You use expectations to define data quality constraints on the contents of a dataset. Expectations allow you to guarantee data arriving in tables meets data quality requirements and provide insights into data quality for each pipeline update. You apply expectations to queries using Python decorators or SQL constraint clauses.
Manage data quality with Delta Live Tables | Databricks on AWS
You can also use open source alternatives. Two best known libraries are:
- Great Expectations
- Soda
Great Expectations
- Python library (retired CLI since April 2023)
- Allows you to define assertions about your data (named expectations)
- Provides a declarative language for describing constraints (Python + JSON)
- Provides expectations gallery with 300+ pre-defined assertions (50+ core)
- A long list of integrations, including data catalogs, data integration tools, data sources (files, in-memory, SQL databases), orchestrators, and notebooks
- Runs data validation using Checkpoints
- Subject matter expert friendly for expectations definition using data assistant
- Automatically generates documentation to display validation results (HTML)
- No official docker image
- Cloud version available
- Great community regarding contributions (GitHub), knowledge exchange and Q&A (Slack)
Soda Core
- CLI tool and Python library
- Allows you to define assertions about your data (named checks)
- Provides a human-readable, domain-specific language for data reliability called Soda Checks Language (YAML)
- Includes 25+ built-in metrics, plus the ability to create user-defined checks (SQL queries)
- Compatible with 20+ data sources (files, in-memory, SQL databases)
- Runs data validation using scans
- Display scan results in the CLI (save to file available) or access them programmatically
- Collects usage statistics (you can opt out)
- Docker image available
- Cloud version available
- Decent community regarding contributions (GitHub), knowledge exchange and Q&A (Slack)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-06-2025 10:21 AM
A year ago we did a bake-off with Soda Core, Great Expectations, deequ and DLT Expectations. Hands-down you want to use DLT expectations. It's built in to DLT and works seamlessly in your pipelines, can quarantine bad data and output statistics.
Since some of our data can be updated, not all of our pipelines can use DLT and we can't use DLT Expectations. I have recently done a small POC with Cuallee, https://github.com/canimus/cuallee. It worked nicely in Databricks and might make a good alternative in these cases.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-12-2025 05:53 AM
Thank you @Rjdudley and @szymon_dybczak for your valuable response.
What are free or open source libraries or tools for implementing data quality framework in databricks ? Any short guidance on how to implement data quality framework in databricks ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-12-2025 10:08 AM
Hi @shubham_007,
You can use Great Expectation python library in Databricks which works on spark engine or configuration. Find more on this link https://docs.greatexpectations.io/docs/core/introduction/ .
Regards,
Hari Prasad
Regards,
Hari Prasad
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-12-2025 10:57 AM
Any short guidance on how to implement data quality framework in databricks ?
With dbdemos, you can learn a practical architecture for data quality testing using the expectations feature of DLT. I hope this helps! (Please note that some DLT syntax might be outdated in certain sections.)
Takuya Omi (尾美拓哉)

