Hi there,
My question is how do we setup a data-quality pipeline with alerting?
Background:
We would like to setup a data-quality pipeline to ensure the data we collect each day is consistent and complete.
We will use key metrics found in our bronze JSON data to determine data quality. If data-quality falls below a preset threshold we would like to get notified and the ETL process should stop in order to prevent โbad dataโ from loading into silver/gold and our ML models.
The solution should scale across multiple data-sources and ideally be visual so we can quickly identify the issue and fix the pipeline when problems occur (like DataFactory but for AWS).
Goals:
- Visual pipeline orchestration to setup pipelines and quickly identify bottle necks and issues
- Scalable alerts/notifications using key-metrics found inside our data that can change
- Alerts/notifications should be sent via Slack to multiple team members.
- Safeguards preventing bad data from entering our ML models. I.e stop the pipeline if a data-quality check fails
Magic Wand Solution:
If I had a magic wand, we would have a visual pipeline orchestration tool that can help us setup/orchestrate each pipeline, visually identify pipeline bottle necks and alert different team members when data-quality checks fail depending on the pipeline.
Let me know if this solution exists or if you have suggestions on how we can quickly setup something similar.
Thanks!
K