cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

What are powerfull data quality tools/libraries to build data quality framework in Databricks ?

shubham_007
Contributor II

Dear Community Experts,

I need your expert advice and suggestions on development of data quality framework. What are powerfull data quality tools or libraries are good to go for development of data quality framework in Databricks ? 

Please guide team.

Regards,

Shubham

1 ACCEPTED SOLUTION

Accepted Solutions

Rjdudley
Valued Contributor III

A year ago we did a bake-off with Soda Core, Great Expectations, deequ and DLT Expectations.  Hands-down you want to use DLT expectations.  It's built in to DLT and works seamlessly in your pipelines, can quarantine bad data and output statistics.

Since some of our data can be updated, not all of our pipelines can use DLT and we can't use DLT Expectations.  I have recently done a small POC with Cuallee, https://github.com/canimus/cuallee.  It worked nicely in Databricks and might make a good alternative in these cases.

View solution in original post

5 REPLIES 5

szymon_dybczak
Esteemed Contributor III

Hi @shubham_007 ,

Databricks DLT gives you ability to define data quality rules. You use expectations to define data quality constraints on the contents of a dataset. Expectations allow you to guarantee data arriving in tables meets data quality requirements and provide insights into data quality for each pipeline update. You apply expectations to queries using Python decorators or SQL constraint clauses.

Manage data quality with Delta Live Tables | Databricks on AWS

You can also use open source alternatives. Two best known libraries are:

- Great Expectations

- Soda

 

Great Expectations

  • Python library (retired CLI since April 2023)
  • Allows you to define assertions about your data (named expectations)
  • Provides a declarative language for describing constraints (Python + JSON)
  • Provides expectations gallery with 300+ pre-defined assertions (50+ core)
  • A long list of integrations, including data catalogs, data integration tools, data sources (files, in-memory, SQL databases), orchestrators, and notebooks
  • Runs data validation using Checkpoints
  • Subject matter expert friendly for expectations definition using data assistant
  • Automatically generates documentation to display validation results (HTML)
  • No official docker image
  • Cloud version available
  • Great community regarding contributions (GitHub), knowledge exchange and Q&A (Slack)

Soda Core

  • CLI tool and Python library
  • Allows you to define assertions about your data (named checks)
  • Provides a human-readable, domain-specific language for data reliability called Soda Checks Language (YAML)
  • Includes 25+ built-in metrics, plus the ability to create user-defined checks (SQL queries)
  • Compatible with 20+ data sources (files, in-memory, SQL databases)
  • Runs data validation using scans
  • Display scan results in the CLI (save to file available) or access them programmatically
  • Collects usage statistics (you can opt out)
  • Docker image available
  • Cloud version available
  • Decent community regarding contributions (GitHub), knowledge exchange and Q&A (Slack)

Rjdudley
Valued Contributor III

A year ago we did a bake-off with Soda Core, Great Expectations, deequ and DLT Expectations.  Hands-down you want to use DLT expectations.  It's built in to DLT and works seamlessly in your pipelines, can quarantine bad data and output statistics.

Since some of our data can be updated, not all of our pipelines can use DLT and we can't use DLT Expectations.  I have recently done a small POC with Cuallee, https://github.com/canimus/cuallee.  It worked nicely in Databricks and might make a good alternative in these cases.

Thank you @Rjdudley and @szymon_dybczak for your valuable response. 

What are free or open source libraries or tools for implementing data quality framework in databricks ? Any short guidance on how to implement data quality framework in databricks ?

hari-prasad
Valued Contributor II

Hi @shubham_007

You can use Great Expectation python library in Databricks which works on spark engine or configuration. Find more on this link https://docs.greatexpectations.io/docs/core/introduction/ .

 

Regards,
Hari Prasad



Regards,
Hari Prasad

Takuya-Omi
Valued Contributor II

Any short guidance on how to implement data quality framework in databricks ?

With dbdemos, you can learn a practical architecture for data quality testing using the expectations feature of DLT. I hope this helps! (Please note that some DLT syntax might be outdated in certain sections.)

https://www.databricks.com/resources/demos/tutorials/data-science-and-ai/unit-testing-delta-live-tab...

 

--------------------------
Takuya Omi (ๅฐพ็พŽๆ‹“ๅ“‰)

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group