<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Job runs indefinitely after integrating with PyDeequ in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/job-runs-indefinitely-after-integrating-with-pydeequ/m-p/11657#M6601</link>
    <description>&lt;P&gt;I don't think databricks support it.&lt;/P&gt;</description>
    <pubDate>Tue, 17 Jan 2023 08:56:50 GMT</pubDate>
    <dc:creator>Hubert-Dudek</dc:creator>
    <dc:date>2023-01-17T08:56:50Z</dc:date>
    <item>
      <title>Job runs indefinitely after integrating with PyDeequ</title>
      <link>https://community.databricks.com/t5/data-engineering/job-runs-indefinitely-after-integrating-with-pydeequ/m-p/11656#M6600</link>
      <description>&lt;P&gt;I'm using PyDeequ data quality checks in one of our jobs. &lt;/P&gt;&lt;P&gt;After adding this check, I noticed that the job does not complete and keeps running indefinitely after PyDeequ checks are completed and results are returned.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;As stated in &lt;A href="https://github.com/awslabs/python-deequ#wrapping-up" alt="https://github.com/awslabs/python-deequ#wrapping-up" target="_blank"&gt;Pydeequ documentation here&lt;/A&gt;, I've added the calls below at the end after all processing is done.&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;spark.sparkContext._gateway.shutdown_callback_server()
spark.stop()&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;However, the job continues to run and has to be eventually cancelled.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Has anyone else faced this while integrating with pydeequ on databricks. &lt;/P&gt;&lt;P&gt;Would appreciate any pointers.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks.&lt;/P&gt;</description>
      <pubDate>Tue, 17 Jan 2023 07:26:46 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/job-runs-indefinitely-after-integrating-with-pydeequ/m-p/11656#M6600</guid>
      <dc:creator>JD410993</dc:creator>
      <dc:date>2023-01-17T07:26:46Z</dc:date>
    </item>
    <item>
      <title>Re: Job runs indefinitely after integrating with PyDeequ</title>
      <link>https://community.databricks.com/t5/data-engineering/job-runs-indefinitely-after-integrating-with-pydeequ/m-p/11657#M6601</link>
      <description>&lt;P&gt;I don't think databricks support it.&lt;/P&gt;</description>
      <pubDate>Tue, 17 Jan 2023 08:56:50 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/job-runs-indefinitely-after-integrating-with-pydeequ/m-p/11657#M6601</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2023-01-17T08:56:50Z</dc:date>
    </item>
    <item>
      <title>Re: Job runs indefinitely after integrating with PyDeequ</title>
      <link>https://community.databricks.com/t5/data-engineering/job-runs-indefinitely-after-integrating-with-pydeequ/m-p/11658#M6602</link>
      <description>&lt;P&gt;Hm, deequ certainly works as I have read about multiple people using it.&lt;/P&gt;&lt;P&gt;And when reading the issues (open/closed) on the github pages of pydeequ, databricks is mentioned in some issues so it might be possible after all.&lt;/P&gt;&lt;P&gt;But I think you need to check your spark version etc as there is an open issue regarding recent spark versions (https://github.com/awslabs/python-deequ/issues/93).&lt;/P&gt;</description>
      <pubDate>Tue, 17 Jan 2023 10:26:58 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/job-runs-indefinitely-after-integrating-with-pydeequ/m-p/11658#M6602</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2023-01-17T10:26:58Z</dc:date>
    </item>
    <item>
      <title>Re: Job runs indefinitely after integrating with PyDeequ</title>
      <link>https://community.databricks.com/t5/data-engineering/job-runs-indefinitely-after-integrating-with-pydeequ/m-p/11659#M6603</link>
      <description>&lt;P&gt;to add to this:&lt;/P&gt;&lt;P&gt;do not create your own sparksession or stop it.  Databricks handles the sparksession.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Here are the announcements of the pydeequ page:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;With PyDeequ v0.1.8+, we now officially support Spark3 ! Just make sure you have an environment variable SPARK_VERSION to specify your Spark version!&lt;/LI&gt;&lt;LI&gt;We've release a blogpost on integrating PyDeequ onto AWS leveraging services such as AWS Glue, Athena, and SageMaker! Check it out: &lt;A href="https://aws.amazon.com/blogs/big-data/monitor-data-quality-in-your-data-lake-using-pydeequ-and-aws-glue/" alt="https://aws.amazon.com/blogs/big-data/monitor-data-quality-in-your-data-lake-using-pydeequ-and-aws-glue/" target="_blank"&gt;Monitor data quality in your data lake using PyDeequ and AWS Glue&lt;/A&gt;.&lt;/LI&gt;&lt;LI&gt;Check out the &lt;A href="https://aws.amazon.com/blogs/big-data/testing-data-quality-at-scale-with-pydeequ/" alt="https://aws.amazon.com/blogs/big-data/testing-data-quality-at-scale-with-pydeequ/" target="_blank"&gt;PyDeequ Release Announcement Blogpost&lt;/A&gt; with a tutorial walkthrough the Amazon Reviews dataset!&lt;/LI&gt;&lt;LI&gt;Join the PyDeequ community on &lt;A href="https://join.slack.com/t/pydeequ/shared_invite/zt-te6bntpu-yaqPy7bhiN8Lu0NxpZs47Q" alt="https://join.slack.com/t/pydeequ/shared_invite/zt-te6bntpu-yaqPy7bhiN8Lu0NxpZs47Q" target="_blank"&gt;PyDeequ Slack&lt;/A&gt; to chat with the devs!&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 17 Jan 2023 10:30:20 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/job-runs-indefinitely-after-integrating-with-pydeequ/m-p/11659#M6603</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2023-01-17T10:30:20Z</dc:date>
    </item>
  </channel>
</rss>

