<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: What are powerfull data quality tools/libraries to build data quality framework in Databricks ? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/what-are-powerfull-data-quality-tools-libraries-to-build-data/m-p/104418#M41738</link>
    <description>&lt;P&gt;A year ago we did a bake-off with Soda Core, Great Expectations, deequ and DLT Expectations.&amp;nbsp; Hands-down you want to use DLT expectations.&amp;nbsp; It's built in to DLT and works seamlessly in your pipelines, can quarantine bad data and output statistics.&lt;/P&gt;&lt;P&gt;Since some of our data can be updated, not all of our pipelines can use DLT and we can't use DLT Expectations.&amp;nbsp; I have recently done a small POC with Cuallee,&amp;nbsp;&lt;A href="https://github.com/canimus/cuallee" target="_blank"&gt;https://github.com/canimus/cuallee&lt;/A&gt;.&amp;nbsp; It worked nicely in Databricks and might make a good alternative in these cases.&lt;/P&gt;</description>
    <pubDate>Mon, 06 Jan 2025 18:21:34 GMT</pubDate>
    <dc:creator>Rjdudley</dc:creator>
    <dc:date>2025-01-06T18:21:34Z</dc:date>
    <item>
      <title>What are powerfull data quality tools/libraries to build data quality framework in Databricks ?</title>
      <link>https://community.databricks.com/t5/data-engineering/what-are-powerfull-data-quality-tools-libraries-to-build-data/m-p/104248#M41692</link>
      <description>&lt;P&gt;Dear Community Experts,&lt;/P&gt;&lt;P&gt;I need your expert advice and suggestions on development of data quality framework. What are powerfull data quality tools or libraries are good to go for development of data quality framework in Databricks ?&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Please guide team.&lt;/P&gt;&lt;P&gt;Regards,&lt;/P&gt;&lt;P&gt;Shubham&lt;/P&gt;</description>
      <pubDate>Sun, 05 Jan 2025 15:44:03 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/what-are-powerfull-data-quality-tools-libraries-to-build-data/m-p/104248#M41692</guid>
      <dc:creator>shubham_007</dc:creator>
      <dc:date>2025-01-05T15:44:03Z</dc:date>
    </item>
    <item>
      <title>Re: What are powerfull data quality tools/libraries to build data quality framework in Databricks ?</title>
      <link>https://community.databricks.com/t5/data-engineering/what-are-powerfull-data-quality-tools-libraries-to-build-data/m-p/104250#M41694</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/100776"&gt;@shubham_007&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;Databricks DLT gives you ability to define data quality rules.&amp;nbsp;&lt;SPAN&gt;You use expectations to define data quality constraints on the contents of a dataset. Expectations allow you to guarantee data arriving in tables meets data quality requirements and provide insights into data quality for each pipeline update. You apply expectations to queries using Python decorators or SQL constraint clauses.&lt;/SPAN&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;A href="https://docs.databricks.com/en/delta-live-tables/expectations.html" target="_blank" rel="noopener"&gt;Manage data quality with Delta Live Tables | Databricks on AWS&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;You can also use open source alternatives. Two best known libraries are:&lt;/P&gt;&lt;P&gt;- Great Expectations&lt;/P&gt;&lt;P&gt;- Soda&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Great Expectations&lt;/STRONG&gt;&lt;/P&gt;&lt;UL class=""&gt;&lt;LI&gt;Python library (retired&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;A class="" href="https://docs.greatexpectations.io/docs/terms/cli/" target="_blank" rel="noopener ugc nofollow"&gt;CLI&lt;/A&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;since&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;A class="" href="https://github.com/great-expectations/great_expectations/pull/7700" target="_blank" rel="noopener ugc nofollow"&gt;April 2023&lt;/A&gt;)&lt;/LI&gt;&lt;LI&gt;Allows you to define assertions about your data (named&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;A class="" href="https://docs.greatexpectations.io/docs/terms/expectation" target="_blank" rel="noopener ugc nofollow"&gt;expectations&lt;/A&gt;)&lt;/LI&gt;&lt;LI&gt;Provides a declarative language for describing constraints (Python + JSON)&lt;/LI&gt;&lt;LI&gt;Provides&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;A class="" href="https://greatexpectations.io/expectations/" target="_blank" rel="noopener ugc nofollow"&gt;expectations gallery&lt;/A&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;with 300+ pre-defined assertions (50+&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;A class="" href="https://greatexpectations.io/expectations/?filterType=Package&amp;amp;viewType=Summary&amp;amp;showFilters=true&amp;amp;subFilterValues=core" target="_blank" rel="noopener ugc nofollow"&gt;core&lt;/A&gt;)&lt;/LI&gt;&lt;LI&gt;A long&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;A class="" href="https://docs.greatexpectations.io/docs/category/integrations" target="_blank" rel="noopener ugc nofollow"&gt;list of integrations&lt;/A&gt;, including data catalogs, data integration tools, data sources (files, in-memory, SQL databases), orchestrators, and notebooks&lt;/LI&gt;&lt;LI&gt;Runs data validation using&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;A class="" href="https://docs.greatexpectations.io/docs/terms/checkpoint" target="_blank" rel="noopener ugc nofollow"&gt;Checkpoints&lt;/A&gt;&lt;/LI&gt;&lt;LI&gt;Subject matter expert friendly for expectations definition using&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;A class="" href="https://docs.greatexpectations.io/docs/guides/expectations/data_assistants/how_to_create_an_expectation_suite_with_the_onboarding_data_assistant" target="_blank" rel="noopener ugc nofollow"&gt;data assistant&lt;/A&gt;&lt;/LI&gt;&lt;LI&gt;Automatically generates&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;A class="" href="https://docs.greatexpectations.io/docs/terms/data_docs" target="_blank" rel="noopener ugc nofollow"&gt;documentation&lt;/A&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;to display validation results (HTML)&lt;/LI&gt;&lt;LI&gt;No official docker image&lt;/LI&gt;&lt;LI&gt;&lt;A class="" href="https://greatexpectations.io/gx-cloud" target="_blank" rel="noopener ugc nofollow"&gt;Cloud version&lt;/A&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;available&lt;/LI&gt;&lt;LI&gt;Great&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;A class="" href="https://greatexpectations.io/community" target="_blank" rel="noopener ugc nofollow"&gt;community&lt;/A&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;regarding contributions (&lt;A class="" href="https://github.com/great-expectations/great_expectations" target="_blank" rel="noopener ugc nofollow"&gt;GitHub&lt;/A&gt;), knowledge exchange and Q&amp;amp;A (&lt;A class="" href="https://greatexpectationstalk.slack.com/" target="_blank" rel="noopener ugc nofollow"&gt;Slack&lt;/A&gt;)&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;STRONG&gt;Soda Core&lt;/STRONG&gt;&lt;/P&gt;&lt;UL class=""&gt;&lt;LI&gt;CLI tool and Python library&lt;/LI&gt;&lt;LI&gt;Allows you to define assertions about your data (named&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;A class="" href="https://docs.soda.io/soda-cl/metrics-and-checks.html" target="_blank" rel="noopener ugc nofollow"&gt;checks&lt;/A&gt;)&lt;/LI&gt;&lt;LI&gt;Provides a human-readable, domain-specific language for data reliability called Soda Checks Language (YAML)&lt;/LI&gt;&lt;LI&gt;Includes&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;A class="" href="https://docs.soda.io/soda-cl/metrics-and-checks.html#list-of-sodacl-metrics-and-checks" target="_blank" rel="noopener ugc nofollow"&gt;25+ built-in metrics&lt;/A&gt;, plus the ability to create user-defined checks (&lt;A class="" href="https://docs.soda.io/soda-cl/user-defined.html" target="_blank" rel="noopener ugc nofollow"&gt;SQL queries&lt;/A&gt;)&lt;/LI&gt;&lt;LI&gt;Compatible with&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;A class="" href="https://github.com/sodadata/soda-core/blob/main/docs/installation.md#compatibility" target="_blank" rel="noopener ugc nofollow"&gt;20+ data sources&lt;/A&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;(files, in-memory, SQL databases)&lt;/LI&gt;&lt;LI&gt;Runs data validation using&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;A class="" href="https://github.com/sodadata/soda-core/blob/main/docs/scan-core.md" target="_blank" rel="noopener ugc nofollow"&gt;scans&lt;/A&gt;&lt;/LI&gt;&lt;LI&gt;Display scan results in the&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;A class="" href="https://github.com/sodadata/soda-core/blob/main/docs/scan-core.md#scan-output" target="_blank" rel="noopener ugc nofollow"&gt;CLI&lt;/A&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;(save to file available) or access them&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;A class="" href="https://github.com/sodadata/soda-core/blob/main/docs/scan-core.md#programmatically-use-scan-output" target="_blank" rel="noopener ugc nofollow"&gt;programmatically&lt;/A&gt;&lt;/LI&gt;&lt;LI&gt;Collects usage statistics (you can&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;A class="" href="https://github.com/sodadata/soda-core/blob/main/docs/usage-stats.md#opt-out-of-usage-statistics" target="_blank" rel="noopener ugc nofollow"&gt;opt out&lt;/A&gt;)&lt;/LI&gt;&lt;LI&gt;&lt;A class="" href="https://hub.docker.com/r/sodadata/soda-core" target="_blank" rel="noopener ugc nofollow"&gt;Docker image&lt;/A&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;available&lt;/LI&gt;&lt;LI&gt;&lt;A class="" href="https://docs.soda.io/soda-cloud/overview.html" target="_blank" rel="noopener ugc nofollow"&gt;Cloud version&lt;/A&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;available&lt;/LI&gt;&lt;LI&gt;Decent community regarding contributions (&lt;A class="" href="https://github.com/sodadata/soda-core" target="_blank" rel="noopener ugc nofollow"&gt;GitHub&lt;/A&gt;), knowledge exchange and Q&amp;amp;A (&lt;A class="" href="https://community.soda.io/slack" target="_blank" rel="noopener ugc nofollow"&gt;Slack&lt;/A&gt;)&lt;/LI&gt;&lt;/UL&gt;</description>
      <pubDate>Sun, 05 Jan 2025 17:07:31 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/what-are-powerfull-data-quality-tools-libraries-to-build-data/m-p/104250#M41694</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2025-01-05T17:07:31Z</dc:date>
    </item>
    <item>
      <title>Re: What are powerfull data quality tools/libraries to build data quality framework in Databricks ?</title>
      <link>https://community.databricks.com/t5/data-engineering/what-are-powerfull-data-quality-tools-libraries-to-build-data/m-p/104418#M41738</link>
      <description>&lt;P&gt;A year ago we did a bake-off with Soda Core, Great Expectations, deequ and DLT Expectations.&amp;nbsp; Hands-down you want to use DLT expectations.&amp;nbsp; It's built in to DLT and works seamlessly in your pipelines, can quarantine bad data and output statistics.&lt;/P&gt;&lt;P&gt;Since some of our data can be updated, not all of our pipelines can use DLT and we can't use DLT Expectations.&amp;nbsp; I have recently done a small POC with Cuallee,&amp;nbsp;&lt;A href="https://github.com/canimus/cuallee" target="_blank"&gt;https://github.com/canimus/cuallee&lt;/A&gt;.&amp;nbsp; It worked nicely in Databricks and might make a good alternative in these cases.&lt;/P&gt;</description>
      <pubDate>Mon, 06 Jan 2025 18:21:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/what-are-powerfull-data-quality-tools-libraries-to-build-data/m-p/104418#M41738</guid>
      <dc:creator>Rjdudley</dc:creator>
      <dc:date>2025-01-06T18:21:34Z</dc:date>
    </item>
    <item>
      <title>Re: What are powerfull data quality tools/libraries to build data quality framework in Databricks ?</title>
      <link>https://community.databricks.com/t5/data-engineering/what-are-powerfull-data-quality-tools-libraries-to-build-data/m-p/105331#M42082</link>
      <description>&lt;P&gt;Thank you&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/107723"&gt;@Rjdudley&lt;/a&gt;&amp;nbsp;and&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/110502"&gt;@szymon_dybczak&lt;/a&gt;&amp;nbsp;for your valuable response.&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;&lt;STRONG&gt;What are free or open source libraries or tools for implementing data quality framework in databricks ? Any short guidance on how to implement data quality framework in databricks ?&lt;/STRONG&gt;&lt;/EM&gt;&lt;/P&gt;</description>
      <pubDate>Sun, 12 Jan 2025 13:53:41 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/what-are-powerfull-data-quality-tools-libraries-to-build-data/m-p/105331#M42082</guid>
      <dc:creator>shubham_007</dc:creator>
      <dc:date>2025-01-12T13:53:41Z</dc:date>
    </item>
    <item>
      <title>Re: What are powerfull data quality tools/libraries to build data quality framework in Databricks ?</title>
      <link>https://community.databricks.com/t5/data-engineering/what-are-powerfull-data-quality-tools-libraries-to-build-data/m-p/105333#M42083</link>
      <description>&lt;P&gt;Hi &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/100776"&gt;@shubham_007&lt;/a&gt;,&amp;nbsp;&lt;/P&gt;&lt;P&gt;You can use Great Expectation python library in Databricks which works on spark engine or configuration. Find more on this link&amp;nbsp;&lt;A href="https://docs.greatexpectations.io/docs/core/introduction/" target="_blank"&gt;https://docs.greatexpectations.io/docs/core/introduction/&lt;/A&gt;&amp;nbsp;.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Regards,&lt;BR /&gt;Hari Prasad&lt;/P&gt;</description>
      <pubDate>Sun, 12 Jan 2025 18:08:53 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/what-are-powerfull-data-quality-tools-libraries-to-build-data/m-p/105333#M42083</guid>
      <dc:creator>hari-prasad</dc:creator>
      <dc:date>2025-01-12T18:08:53Z</dc:date>
    </item>
    <item>
      <title>Re: What are powerfull data quality tools/libraries to build data quality framework in Databricks ?</title>
      <link>https://community.databricks.com/t5/data-engineering/what-are-powerfull-data-quality-tools-libraries-to-build-data/m-p/105335#M42084</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;EM&gt;&lt;STRONG&gt;Any short guidance on how to implement data quality framework in databricks ?&lt;/STRONG&gt;&lt;/EM&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;With dbdemos, you can learn a practical architecture for data quality testing using the expectations feature of DLT. I hope this helps! (Please note that some DLT syntax might be outdated in certain sections.)&lt;/P&gt;&lt;P&gt;&lt;A href="https://www.databricks.com/resources/demos/tutorials/data-science-and-ai/unit-testing-delta-live-table-for-production-grade-pipelines" target="_blank"&gt;https://www.databricks.com/resources/demos/tutorials/data-science-and-ai/unit-testing-delta-live-table-for-production-grade-pipelines&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 12 Jan 2025 18:57:24 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/what-are-powerfull-data-quality-tools-libraries-to-build-data/m-p/105335#M42084</guid>
      <dc:creator>Takuya-Omi</dc:creator>
      <dc:date>2025-01-12T18:57:24Z</dc:date>
    </item>
    <item>
      <title>Re: What are powerfull data quality tools/libraries to build data quality framework in Databricks ?</title>
      <link>https://community.databricks.com/t5/data-engineering/what-are-powerfull-data-quality-tools-libraries-to-build-data/m-p/136641#M50624</link>
      <description>&lt;P&gt;&lt;SPAN&gt;Consider our open-source data quality tool, DataOps Data Quality TestGen. Our goal is to help data teams automatically generate 80% of the data tests they need with just a few clicks, while offering a nice UI for collaborating on the remaining 20% the tests unique to their organization. It learns your data and automatically applies over 60 different data quality tests.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;It’s licensed under Apache 2.0 and performs data profiling, data cataloging, hygiene reviews of new datasets, and quality dashboarding.&amp;nbsp; We are a private, profitable company that developed this tool as part of our work with large and small customers.&amp;nbsp; Open source is a full-featured solution, and the enterprise version is reasonably priced. &lt;/SPAN&gt;&lt;A href="https://info.datakitchen.io/install-dataops-data-quality-testgen-today" target="_blank"&gt;&lt;SPAN&gt;https://info.datakitchen.io/install-dataops-data-quality-testgen-today&lt;/SPAN&gt;&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 29 Oct 2025 20:44:41 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/what-are-powerfull-data-quality-tools-libraries-to-build-data/m-p/136641#M50624</guid>
      <dc:creator>ChrisBergh-Data</dc:creator>
      <dc:date>2025-10-29T20:44:41Z</dc:date>
    </item>
  </channel>
</rss>

