Databricks Community

shubham_007 · ‎01-05-2025

Dear Community Experts,

I need your expert advice and suggestions on development of data quality framework. What are powerfull data quality tools or libraries are good to go for development of data quality framework in Databricks ?

Please guide team.

Regards,

Shubham

Rjdudley · ‎01-06-2025

A year ago we did a bake-off with Soda Core, Great Expectations, deequ and DLT Expectations. Hands-down you want to use DLT expectations. It's built in to DLT and works seamlessly in your pipelines, can quarantine bad data and output statistics.

Since some of our data can be updated, not all of our pipelines can use DLT and we can't use DLT Expectations. I have recently done a small POC with Cuallee, https://github.com/canimus/cuallee. It worked nicely in Databricks and might make a good alternative in these cases.

View solution in original post

szymon_dybczak · ‎01-05-2025

Hi @shubham_007 ,

Databricks DLT gives you ability to define data quality rules. You use expectations to define data quality constraints on the contents of a dataset. Expectations allow you to guarantee data arriving in tables meets data quality requirements and provide insights into data quality for each pipeline update. You apply expectations to queries using Python decorators or SQL constraint clauses.

Manage data quality with Delta Live Tables | Databricks on AWS

You can also use open source alternatives. Two best known libraries are:

- Great Expectations

- Soda

Great Expectations

Python library (retired CLI since April 2023)
Allows you to define assertions about your data (named expectations)
Provides a declarative language for describing constraints (Python + JSON)
Provides expectations gallery with 300+ pre-defined assertions (50+ core)
A long list of integrations, including data catalogs, data integration tools, data sources (files, in-memory, SQL databases), orchestrators, and notebooks
Runs data validation using Checkpoints
Subject matter expert friendly for expectations definition using data assistant
Automatically generates documentation to display validation results (HTML)
No official docker image
Cloud version available
Great community regarding contributions (GitHub), knowledge exchange and Q&A (Slack)

Soda Core

CLI tool and Python library
Allows you to define assertions about your data (named checks)
Provides a human-readable, domain-specific language for data reliability called Soda Checks Language (YAML)
Includes 25+ built-in metrics, plus the ability to create user-defined checks (SQL queries)
Compatible with 20+ data sources (files, in-memory, SQL databases)
Runs data validation using scans
Display scan results in the CLI (save to file available) or access them programmatically
Collects usage statistics (you can opt out)
Docker image available
Cloud version available
Decent community regarding contributions (GitHub), knowledge exchange and Q&A (Slack)

Rjdudley · ‎01-06-2025

A year ago we did a bake-off with Soda Core, Great Expectations, deequ and DLT Expectations. Hands-down you want to use DLT expectations. It's built in to DLT and works seamlessly in your pipelines, can quarantine bad data and output statistics.

Since some of our data can be updated, not all of our pipelines can use DLT and we can't use DLT Expectations. I have recently done a small POC with Cuallee, https://github.com/canimus/cuallee. It worked nicely in Databricks and might make a good alternative in these cases.

shubham_007 · ‎01-12-2025

Thank you @Rjdudley and @szymon_dybczak for your valuable response.

What are free or open source libraries or tools for implementing data quality framework in databricks ? Any short guidance on how to implement data quality framework in databricks ?

hari-prasad · ‎01-12-2025

Hi @shubham_007,

You can use Great Expectation python library in Databricks which works on spark engine or configuration. Find more on this link https://docs.greatexpectations.io/docs/core/introduction/ .

Regards,
Hari Prasad

Takuya-Omi · ‎01-12-2025

Any short guidance on how to implement data quality framework in databricks ?

With dbdemos, you can learn a practical architecture for data quality testing using the expectations feature of DLT. I hope this helps! (Please note that some DLT syntax might be outdated in certain sections.)

https://www.databricks.com/resources/demos/tutorials/data-science-and-ai/unit-testing-delta-live-tab...

--------------------------
Takuya Omi (尾美拓哉)