cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Building a Data Quality pipeline with alerting

Kash
Contributor III

Hi there,

My question is how do we setup a data-quality pipeline with alerting?

Background:

We would like to setup a data-quality pipeline to ensure the data we collect each day is consistent and complete.

We will use key metrics found in our bronze JSON data to determine data quality. If data-quality falls below a preset threshold we would like to get notified and the ETL process should stop in order to prevent โ€œbad dataโ€ from loading into silver/gold and our ML models.

The solution should scale across multiple data-sources and ideally be visual so we can quickly identify the issue and fix the pipeline when problems occur (like DataFactory but for AWS).

Goals:

  1. Visual pipeline orchestration to setup pipelines and quickly identify bottle necks and issues
  2. Scalable alerts/notifications using key-metrics found inside our data that can change
  3. Alerts/notifications should be sent via Slack to multiple team members.
  4. Safeguards preventing bad data from entering our ML models. I.e stop the pipeline if a data-quality check fails

Magic Wand Solution:

If I had a magic wand, we would have a visual pipeline orchestration tool that can help us setup/orchestrate each pipeline, visually identify pipeline bottle necks and alert different team members when data-quality checks fail depending on the pipeline.

Let me know if this solution exists or if you have suggestions on how we can quickly setup something similar.

Thanks!

K

2 REPLIES 2

User16753725469
Contributor II

Hi @Avkash Kanaโ€‹  I would suggest using Delta Live Table (DLT) it has the features you are looking for https://docs.databricks.com/workflows/delta-live-tables/index.html

joarobles
New Contributor III

Hi Kash!

I know it might be too late, but if you managed to create this by yourself and you are struggling to scale the solution you could take a look at Rudol Data Quality, it covers up pretty much everything you mentioned with a focus on enabling non-technical roles to be part of Data Quality as well.

Have a high-quality week!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group