โ01-05-2025 07:44 AM
Dear Community Experts,
I need your expert advice and suggestions on development of data quality framework. What are powerfull data quality tools or libraries are good to go for development of data quality framework in Databricks ?
Please guide team.
Regards,
Shubham
โ01-06-2025 10:21 AM
A year ago we did a bake-off with Soda Core, Great Expectations, deequ and DLT Expectations. Hands-down you want to use DLT expectations. It's built in to DLT and works seamlessly in your pipelines, can quarantine bad data and output statistics.
Since some of our data can be updated, not all of our pipelines can use DLT and we can't use DLT Expectations. I have recently done a small POC with Cuallee, https://github.com/canimus/cuallee. It worked nicely in Databricks and might make a good alternative in these cases.
โ01-05-2025 09:07 AM
Hi @shubham_007 ,
Databricks DLT gives you ability to define data quality rules. You use expectations to define data quality constraints on the contents of a dataset. Expectations allow you to guarantee data arriving in tables meets data quality requirements and provide insights into data quality for each pipeline update. You apply expectations to queries using Python decorators or SQL constraint clauses.
Manage data quality with Delta Live Tables | Databricks on AWS
You can also use open source alternatives. Two best known libraries are:
- Great Expectations
- Soda
Great Expectations
Soda Core
โ01-06-2025 10:21 AM
A year ago we did a bake-off with Soda Core, Great Expectations, deequ and DLT Expectations. Hands-down you want to use DLT expectations. It's built in to DLT and works seamlessly in your pipelines, can quarantine bad data and output statistics.
Since some of our data can be updated, not all of our pipelines can use DLT and we can't use DLT Expectations. I have recently done a small POC with Cuallee, https://github.com/canimus/cuallee. It worked nicely in Databricks and might make a good alternative in these cases.
โ01-12-2025 05:53 AM
Thank you @Rjdudley and @szymon_dybczak for your valuable response.
What are free or open source libraries or tools for implementing data quality framework in databricks ? Any short guidance on how to implement data quality framework in databricks ?
โ01-12-2025 10:08 AM
Hi @shubham_007,
You can use Great Expectation python library in Databricks which works on spark engine or configuration. Find more on this link https://docs.greatexpectations.io/docs/core/introduction/ .
Regards,
Hari Prasad
โ01-12-2025 10:57 AM
Any short guidance on how to implement data quality framework in databricks ?
With dbdemos, you can learn a practical architecture for data quality testing using the expectations feature of DLT. I hope this helps! (Please note that some DLT syntax might be outdated in certain sections.)
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group