topic Re: Building a Data Quality pipeline with alerting in Machine Learning

Building a Data Quality pipeline with alerting

Kash — Fri, 01 Jul 2022 14:44:42 GMT

Hi there,

My question is how do we setup a data-quality pipeline with alerting?

Background:

We would like to setup a data-quality pipeline to ensure the data we collect each day is consistent and complete.

We will use key metrics found in our bronze JSON data to determine data quality. If data-quality falls below a preset threshold we would like to get notified and the ETL process should stop in order to prevent “bad data” from loading into silver/gold and our ML models.

The solution should scale across multiple data-sources and ideally be visual so we can quickly identify the issue and fix the pipeline when problems occur (like DataFactory but for AWS).

Goals:

Visual pipeline orchestration to setup pipelines and quickly identify bottle necks and issues
Scalable alerts/notifications using key-metrics found inside our data that can change
Alerts/notifications should be sent via Slack to multiple team members.
Safeguards preventing bad data from entering our ML models. I.e stop the pipeline if a data-quality check fails

Magic Wand Solution:

If I had a magic wand, we would have a visual pipeline orchestration tool that can help us setup/orchestrate each pipeline, visually identify pipeline bottle necks and alert different team members when data-quality checks fail depending on the pipeline.

Let me know if this solution exists or if you have suggestions on how we can quickly setup something similar.

Thanks!

Re: Building a Data Quality pipeline with alerting

User16753725469 — Fri, 09 Sep 2022 15:35:03 GMT

Hi @Avkash Kana I would suggest using Delta Live Table (DLT) it has the features you are looking for https://docs.databricks.com/workflows/delta-live-tables/index.html

Re: Building a Data Quality pipeline with alerting

joarobles — Thu, 25 Jul 2024 15:41:29 GMT

Hi Kash!

I know it might be too late, but if you managed to create this by yourself and you are struggling to scale the solution you could take a look at Rudol Data Quality, it covers up pretty much everything you mentioned with a focus on enabling non-technical roles to be part of Data Quality as well.

Have a high-quality week!

Re: Building a Data Quality pipeline with alerting

dataoculus_app — Wed, 18 Jun 2025 07:21:50 GMT

Hi Kash,

on 4th point, do you guys have realtime ingestion to model ? or its batch. in case of batch, DLT will be fine i guess. but would love to know more. never seen realtime model updates ealier.