cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

What are powerfull data quality tools/libraries to build data quality framework in Databricks ?

shubham_007
Contributor II

Dear Community Experts,

I need your expert advice and suggestions on development of data quality framework. What are powerfull data quality tools or libraries are good to go for development of data quality framework in Databricks ? 

Please guide team.

Regards,

Shubham

2 REPLIES 2

szymon_dybczak
Esteemed Contributor III

Hi @shubham_007 ,

Databricks DLT gives you ability to define data quality rules. You use expectations to define data quality constraints on the contents of a dataset. Expectations allow you to guarantee data arriving in tables meets data quality requirements and provide insights into data quality for each pipeline update. You apply expectations to queries using Python decorators or SQL constraint clauses.

Manage data quality with Delta Live Tables | Databricks on AWS

You can also use open source alternatives. Two best known libraries are:

- Great Expectations

- Soda

 

Great Expectations

  • Python library (retired CLI since April 2023)
  • Allows you to define assertions about your data (named expectations)
  • Provides a declarative language for describing constraints (Python + JSON)
  • Provides expectations gallery with 300+ pre-defined assertions (50+ core)
  • A long list of integrations, including data catalogs, data integration tools, data sources (files, in-memory, SQL databases), orchestrators, and notebooks
  • Runs data validation using Checkpoints
  • Subject matter expert friendly for expectations definition using data assistant
  • Automatically generates documentation to display validation results (HTML)
  • No official docker image
  • Cloud version available
  • Great community regarding contributions (GitHub), knowledge exchange and Q&A (Slack)

Soda Core

  • CLI tool and Python library
  • Allows you to define assertions about your data (named checks)
  • Provides a human-readable, domain-specific language for data reliability called Soda Checks Language (YAML)
  • Includes 25+ built-in metrics, plus the ability to create user-defined checks (SQL queries)
  • Compatible with 20+ data sources (files, in-memory, SQL databases)
  • Runs data validation using scans
  • Display scan results in the CLI (save to file available) or access them programmatically
  • Collects usage statistics (you can opt out)
  • Docker image available
  • Cloud version available
  • Decent community regarding contributions (GitHub), knowledge exchange and Q&A (Slack)

Rjdudley
Contributor II

A year ago we did a bake-off with Soda Core, Great Expectations, deequ and DLT Expectations.  Hands-down you want to use DLT expectations.  It's built in to DLT and works seamlessly in your pipelines, can quarantine bad data and output statistics.

Since some of our data can be updated, not all of our pipelines can use DLT and we can't use DLT Expectations.  I have recently done a small POC with Cuallee, https://github.com/canimus/cuallee.  It worked nicely in Databricks and might make a good alternative in these cases.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group