<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Adding deduplication method to spark streaming in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/adding-deduplication-method-to-spark-streaming/m-p/21445#M14616</link>
    <description>&lt;P&gt;Hi everyone, I am having some troubles to add a deduplication step on a file streaming that is already running. The code I am trying to add is this one:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;df = df.withWatermark("arrival_time", "20 minutes")\
.dropDuplicates(["event_id", "arrival_time"])&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;However, I am getting the following error. &lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;Caused by: java.lang.IllegalStateException: Error reading streaming state file of HDFSStateStoreProvider[id = (op=0,part=101),dir = dbfs:/mnt/checkpoints/silver_events/state/0/101]: dbfs:/mnt/checkpoints/silver_events/state/0/101/1.delta does not exist. If the stream job is restarted with a new or updated state operation, please create a new checkpoint location or clear the existing checkpoint location.&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;MI two questions are:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;  Why I am getting this error and what does it mean?&lt;/LI&gt;&lt;LI&gt;Is it really possible to delete a streamings checkpoint and not get duplicated data when restarting the streaming?&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thank you!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Wed, 04 May 2022 12:35:26 GMT</pubDate>
    <dc:creator>patojo94</dc:creator>
    <dc:date>2022-05-04T12:35:26Z</dc:date>
    <item>
      <title>Adding deduplication method to spark streaming</title>
      <link>https://community.databricks.com/t5/data-engineering/adding-deduplication-method-to-spark-streaming/m-p/21445#M14616</link>
      <description>&lt;P&gt;Hi everyone, I am having some troubles to add a deduplication step on a file streaming that is already running. The code I am trying to add is this one:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;df = df.withWatermark("arrival_time", "20 minutes")\
.dropDuplicates(["event_id", "arrival_time"])&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;However, I am getting the following error. &lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;Caused by: java.lang.IllegalStateException: Error reading streaming state file of HDFSStateStoreProvider[id = (op=0,part=101),dir = dbfs:/mnt/checkpoints/silver_events/state/0/101]: dbfs:/mnt/checkpoints/silver_events/state/0/101/1.delta does not exist. If the stream job is restarted with a new or updated state operation, please create a new checkpoint location or clear the existing checkpoint location.&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;MI two questions are:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;  Why I am getting this error and what does it mean?&lt;/LI&gt;&lt;LI&gt;Is it really possible to delete a streamings checkpoint and not get duplicated data when restarting the streaming?&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thank you!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 04 May 2022 12:35:26 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/adding-deduplication-method-to-spark-streaming/m-p/21445#M14616</guid>
      <dc:creator>patojo94</dc:creator>
      <dc:date>2022-05-04T12:35:26Z</dc:date>
    </item>
  </channel>
</rss>

