Building a Data Quality pipeline with alerting
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-01-2022 07:44 AM
Hi there,
My question is how do we setup a data-quality pipeline with alerting?
Background:
We would like to setup a data-quality pipeline to ensure the data we collect each day is consistent and complete.
We will use key metrics found in our bronze JSON data to determine data quality. If data-quality falls below a preset threshold we would like to get notified and the ETL process should stop in order to prevent “bad data” from loading into silver/gold and our ML models.
The solution should scale across multiple data-sources and ideally be visual so we can quickly identify the issue and fix the pipeline when problems occur (like DataFactory but for AWS).
Goals:
- Visual pipeline orchestration to setup pipelines and quickly identify bottle necks and issues
- Scalable alerts/notifications using key-metrics found inside our data that can change
- Alerts/notifications should be sent via Slack to multiple team members.
- Safeguards preventing bad data from entering our ML models. I.e stop the pipeline if a data-quality check fails
Magic Wand Solution:
If I had a magic wand, we would have a visual pipeline orchestration tool that can help us setup/orchestrate each pipeline, visually identify pipeline bottle necks and alert different team members when data-quality checks fail depending on the pipeline.
Let me know if this solution exists or if you have suggestions on how we can quickly setup something similar.
Thanks!
K
- Labels:
-
Autoloader
-
Data Quality
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-09-2022 08:35 AM
Hi @Avkash Kana I would suggest using Delta Live Table (DLT) it has the features you are looking for https://docs.databricks.com/workflows/delta-live-tables/index.html
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-25-2024 08:41 AM
Hi Kash!
I know it might be too late, but if you managed to create this by yourself and you are struggling to scale the solution you could take a look at Rudol Data Quality, it covers up pretty much everything you mentioned with a focus on enabling non-technical roles to be part of Data Quality as well.
Have a high-quality week!

