<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: data quality check in data engineering in Get Started Discussions</title>
    <link>https://community.databricks.com/t5/get-started-discussions/data-quality-check-in-data-engineering/m-p/57449#M7925</link>
    <description>&lt;P&gt;In Databricks, you can install external libraries by going to the Clusters tab, selecting your cluster, and then adding the Maven coordinates for Deequ.&lt;/P&gt;&lt;P&gt;In your notebook or script, you need to create a Spark session with the Deequ library added as a dependency. This can be done using the spark.jars.packages configuration option.&lt;/P&gt;&lt;P&gt;spark.conf.set("spark.jars.packages", "com.amazon.deequ:deequ:1.4.0")&lt;/P&gt;&lt;P&gt;Write your data quality checks using Deequ functions. For example:&lt;/P&gt;&lt;P&gt;import com.amazon.deequ.{VerificationSuite, VerificationResult}&lt;BR /&gt;import com.amazon.deequ.VerificationSuite._&lt;/P&gt;&lt;P&gt;val verificationResult: VerificationResult = VerificationSuite()&lt;BR /&gt;.onData(yourDataFrame)&lt;BR /&gt;.addCheck(&lt;BR /&gt;check = Check(yourColumn, "yourConstraint") // Define your data quality constraint here&lt;BR /&gt;)&lt;BR /&gt;.run()&lt;/P&gt;</description>
    <pubDate>Tue, 16 Jan 2024 13:09:13 GMT</pubDate>
    <dc:creator>CharlesReily</dc:creator>
    <dc:date>2024-01-16T13:09:13Z</dc:date>
    <item>
      <title>data quality check in data engineering</title>
      <link>https://community.databricks.com/t5/get-started-discussions/data-quality-check-in-data-engineering/m-p/46363#M7919</link>
      <description>&lt;P&gt;Can we use deequ library with azure databricks ? if yes Please provide some support material or examples&lt;/P&gt;&lt;P&gt;Is there any similar data quality library or suggestion to achieve automatic data quality check during data engineering (Azure databricks)&lt;/P&gt;&lt;P&gt;Thanks in advance,&lt;/P&gt;&lt;P&gt;Anantha&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 27 Sep 2023 09:35:36 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/data-quality-check-in-data-engineering/m-p/46363#M7919</guid>
      <dc:creator>ankris</dc:creator>
      <dc:date>2023-09-27T09:35:36Z</dc:date>
    </item>
    <item>
      <title>Re: data quality check in data engineering</title>
      <link>https://community.databricks.com/t5/get-started-discussions/data-quality-check-in-data-engineering/m-p/46520#M7920</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/54486"&gt;@ankris&lt;/a&gt;&amp;nbsp;Never used deequ, so cannot say anything about this library but I've used great expectations and it works really good.&lt;/P&gt;&lt;P&gt;&lt;A href="https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/968100988546031/3944681434720936/8836542754149149/latest.html" target="_blank"&gt;https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/968100988546031/3944681434720936/8836542754149149/latest.html&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 28 Sep 2023 05:46:41 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/data-quality-check-in-data-engineering/m-p/46520#M7920</guid>
      <dc:creator>daniel_sahal</dc:creator>
      <dc:date>2023-09-28T05:46:41Z</dc:date>
    </item>
    <item>
      <title>Re: data quality check in data engineering</title>
      <link>https://community.databricks.com/t5/get-started-discussions/data-quality-check-in-data-engineering/m-p/46696#M7921</link>
      <description>&lt;P&gt;Thank you for prompt reply. I will go through it.&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Meantime could anyone provide information on Deequ usage with azure databricks.&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Also any other library/package (equivalent to deequ) can be used in azure databricks.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 29 Sep 2023 05:43:17 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/data-quality-check-in-data-engineering/m-p/46696#M7921</guid>
      <dc:creator>ankris</dc:creator>
      <dc:date>2023-09-29T05:43:17Z</dc:date>
    </item>
    <item>
      <title>Re: data quality check in data engineering</title>
      <link>https://community.databricks.com/t5/get-started-discussions/data-quality-check-in-data-engineering/m-p/46722#M7922</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/54486"&gt;@ankris&lt;/a&gt;&amp;nbsp;can you describe your data pipeline a bit? If you are writing in Delta Live Tables (my recommendation) then you can express data quality checks "in flight" as the pipeline processes data. You can do post-ETL data quality checks (e.g. at the aggregate level) with Lakehouse Monitoring.&lt;/P&gt;</description>
      <pubDate>Fri, 29 Sep 2023 09:03:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/data-quality-check-in-data-engineering/m-p/46722#M7922</guid>
      <dc:creator>BilalAslamDbrx</dc:creator>
      <dc:date>2023-09-29T09:03:34Z</dc:date>
    </item>
    <item>
      <title>Re: data quality check in data engineering</title>
      <link>https://community.databricks.com/t5/get-started-discussions/data-quality-check-in-data-engineering/m-p/49290#M7924</link>
      <description>&lt;P&gt;We have data pipeline build with delta table and orchestrated using ADF. On top of it we would like to build Data quality assessor using external web interface (already build with streamlit app).Also plan to pass some variables using external interface to evaluate data quality.&amp;nbsp; &amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 16 Oct 2023 10:33:25 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/data-quality-check-in-data-engineering/m-p/49290#M7924</guid>
      <dc:creator>ankris</dc:creator>
      <dc:date>2023-10-16T10:33:25Z</dc:date>
    </item>
    <item>
      <title>Re: data quality check in data engineering</title>
      <link>https://community.databricks.com/t5/get-started-discussions/data-quality-check-in-data-engineering/m-p/57449#M7925</link>
      <description>&lt;P&gt;In Databricks, you can install external libraries by going to the Clusters tab, selecting your cluster, and then adding the Maven coordinates for Deequ.&lt;/P&gt;&lt;P&gt;In your notebook or script, you need to create a Spark session with the Deequ library added as a dependency. This can be done using the spark.jars.packages configuration option.&lt;/P&gt;&lt;P&gt;spark.conf.set("spark.jars.packages", "com.amazon.deequ:deequ:1.4.0")&lt;/P&gt;&lt;P&gt;Write your data quality checks using Deequ functions. For example:&lt;/P&gt;&lt;P&gt;import com.amazon.deequ.{VerificationSuite, VerificationResult}&lt;BR /&gt;import com.amazon.deequ.VerificationSuite._&lt;/P&gt;&lt;P&gt;val verificationResult: VerificationResult = VerificationSuite()&lt;BR /&gt;.onData(yourDataFrame)&lt;BR /&gt;.addCheck(&lt;BR /&gt;check = Check(yourColumn, "yourConstraint") // Define your data quality constraint here&lt;BR /&gt;)&lt;BR /&gt;.run()&lt;/P&gt;</description>
      <pubDate>Tue, 16 Jan 2024 13:09:13 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/data-quality-check-in-data-engineering/m-p/57449#M7925</guid>
      <dc:creator>CharlesReily</dc:creator>
      <dc:date>2024-01-16T13:09:13Z</dc:date>
    </item>
    <item>
      <title>Re: data quality check in data engineering</title>
      <link>https://community.databricks.com/t5/get-started-discussions/data-quality-check-in-data-engineering/m-p/80591#M7926</link>
      <description>&lt;P&gt;Hi there!&amp;nbsp;&lt;/P&gt;&lt;P&gt;You could also take a look at&amp;nbsp;&lt;A href="https://rudol.ai" target="_self"&gt;Rudol&lt;/A&gt;, it enables no-code Data Quality validations to enable non-technical roles such as Business Analysts or Data Stewards to configure quality checks by themselves.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 25 Jul 2024 15:36:09 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/data-quality-check-in-data-engineering/m-p/80591#M7926</guid>
      <dc:creator>joarobles</dc:creator>
      <dc:date>2024-07-25T15:36:09Z</dc:date>
    </item>
  </channel>
</rss>

