<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Building a Data Quality pipeline with alerting in Machine Learning</title>
    <link>https://community.databricks.com/t5/machine-learning/building-a-data-quality-pipeline-with-alerting/m-p/122090#M4121</link>
    <description>&lt;P&gt;Hi Kash,&amp;nbsp;&lt;/P&gt;&lt;P&gt;on 4th point, do you guys have realtime ingestion to model ? or its batch. in case of batch, DLT will be fine i guess. but would love to know more. never seen realtime model updates ealier.&lt;/P&gt;</description>
    <pubDate>Wed, 18 Jun 2025 07:21:50 GMT</pubDate>
    <dc:creator>dataoculus_app</dc:creator>
    <dc:date>2025-06-18T07:21:50Z</dc:date>
    <item>
      <title>Building a Data Quality pipeline with alerting</title>
      <link>https://community.databricks.com/t5/machine-learning/building-a-data-quality-pipeline-with-alerting/m-p/15033#M812</link>
      <description>&lt;P&gt;Hi there,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;My question is how do we setup a data-quality pipeline with alerting?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;Background: &lt;/B&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;We would like to setup a data-quality pipeline to ensure the data we collect each day is consistent and complete. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;We will use key metrics found in our bronze JSON data to determine data quality.  If data-quality falls below a preset threshold we would like to get notified and the ETL process should stop in order to prevent “bad data” from loading into silver/gold and our ML models.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The solution should scale across multiple data-sources and ideally be visual so we can quickly identify the issue and fix the pipeline when problems occur (like DataFactory but for AWS). &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;Goals:&lt;/B&gt;&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;Visual pipeline orchestration to setup pipelines and  quickly identify bottle necks and issues&lt;/LI&gt;&lt;LI&gt;Scalable alerts/notifications using key-metrics found inside our data that can change&lt;/LI&gt;&lt;LI&gt;Alerts/notifications should be sent via Slack to multiple team members.&lt;/LI&gt;&lt;LI&gt;Safeguards preventing bad data from entering our ML models. I.e stop the pipeline if a data-quality check fails&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;Magic Wand Solution:&lt;/B&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;If I had a magic wand, we would have a visual pipeline orchestration tool that can help us setup/orchestrate each pipeline, visually identify pipeline bottle necks and alert different team members when data-quality checks fail depending on the pipeline.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Let me know if this solution exists or if you have suggestions on how we can quickly setup something similar.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;K&lt;/P&gt;</description>
      <pubDate>Fri, 01 Jul 2022 14:44:42 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/building-a-data-quality-pipeline-with-alerting/m-p/15033#M812</guid>
      <dc:creator>Kash</dc:creator>
      <dc:date>2022-07-01T14:44:42Z</dc:date>
    </item>
    <item>
      <title>Re: Building a Data Quality pipeline with alerting</title>
      <link>https://community.databricks.com/t5/machine-learning/building-a-data-quality-pipeline-with-alerting/m-p/15034#M813</link>
      <description>&lt;P&gt;Hi @Avkash Kana​&amp;nbsp; I would suggest using Delta Live Table (DLT) it has the features you are looking for &lt;A href="https://docs.databricks.com/workflows/delta-live-tables/index.html" target="test_blank"&gt;https://docs.databricks.com/workflows/delta-live-tables/index.html&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 09 Sep 2022 15:35:03 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/building-a-data-quality-pipeline-with-alerting/m-p/15034#M813</guid>
      <dc:creator>User16753725469</dc:creator>
      <dc:date>2022-09-09T15:35:03Z</dc:date>
    </item>
    <item>
      <title>Re: Building a Data Quality pipeline with alerting</title>
      <link>https://community.databricks.com/t5/machine-learning/building-a-data-quality-pipeline-with-alerting/m-p/80593#M3523</link>
      <description>&lt;P&gt;Hi Kash!&lt;/P&gt;&lt;P&gt;I know it might be too late, but if you managed to create this by yourself and you are struggling to scale the solution you could take a look at &lt;A href="https://rudol.ai" target="_self"&gt;Rudol Data Quality&lt;/A&gt;, it covers up pretty much everything you mentioned with a focus on enabling non-technical roles to be part of Data Quality as well.&lt;/P&gt;&lt;P&gt;Have a high-quality week!&lt;/P&gt;</description>
      <pubDate>Thu, 25 Jul 2024 15:41:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/building-a-data-quality-pipeline-with-alerting/m-p/80593#M3523</guid>
      <dc:creator>joarobles</dc:creator>
      <dc:date>2024-07-25T15:41:29Z</dc:date>
    </item>
    <item>
      <title>Re: Building a Data Quality pipeline with alerting</title>
      <link>https://community.databricks.com/t5/machine-learning/building-a-data-quality-pipeline-with-alerting/m-p/122090#M4121</link>
      <description>&lt;P&gt;Hi Kash,&amp;nbsp;&lt;/P&gt;&lt;P&gt;on 4th point, do you guys have realtime ingestion to model ? or its batch. in case of batch, DLT will be fine i guess. but would love to know more. never seen realtime model updates ealier.&lt;/P&gt;</description>
      <pubDate>Wed, 18 Jun 2025 07:21:50 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/building-a-data-quality-pipeline-with-alerting/m-p/122090#M4121</guid>
      <dc:creator>dataoculus_app</dc:creator>
      <dc:date>2025-06-18T07:21:50Z</dc:date>
    </item>
  </channel>
</rss>

