<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Deduplication with rocksdb, should old state files be deleted manually (to manage storage size)? in Get Started Discussions</title>
    <link>https://community.databricks.com/t5/get-started-discussions/deduplication-with-rocksdb-should-old-state-files-be-deleted/m-p/112569#M9208</link>
    <description>&lt;P&gt;Hi, I have following streaming setup:&lt;/P&gt;&lt;P&gt;I want to remove duplicates in streaming.&lt;/P&gt;&lt;P&gt;1) deduplication strategy is defined by two fields:&amp;nbsp;extraction_timestamp and hash (row wise hash)&lt;/P&gt;&lt;P&gt;2) watermark strategy: extraction_timestamp with "10 seconds" interval&lt;/P&gt;&lt;P&gt;--&amp;gt; Removing duplicates in side of extraction_timestamp&lt;/P&gt;&lt;P&gt;Now rocksdb and state management works fine (it uses watermark correctly and does not need/check old active states --&amp;gt; just I wanted, because I know that there is no need to check these states because extraction time between these streaming batch are more than 10 seconds). The problem is that because extraction_timestamp is part of the deduplication strategy all state are "active" and rocksdb is not cleaning up these old state files.&lt;/P&gt;&lt;P&gt;I wondered is there any other option that "manually" delete these state files to manage storage size (like I say these are not used because of watermark logic)? Why rocksdb is not using watermark as a part of cleaning plan?&lt;/P&gt;</description>
    <pubDate>Fri, 14 Mar 2025 10:59:11 GMT</pubDate>
    <dc:creator>LasseL</dc:creator>
    <dc:date>2025-03-14T10:59:11Z</dc:date>
    <item>
      <title>Deduplication with rocksdb, should old state files be deleted manually (to manage storage size)?</title>
      <link>https://community.databricks.com/t5/get-started-discussions/deduplication-with-rocksdb-should-old-state-files-be-deleted/m-p/112569#M9208</link>
      <description>&lt;P&gt;Hi, I have following streaming setup:&lt;/P&gt;&lt;P&gt;I want to remove duplicates in streaming.&lt;/P&gt;&lt;P&gt;1) deduplication strategy is defined by two fields:&amp;nbsp;extraction_timestamp and hash (row wise hash)&lt;/P&gt;&lt;P&gt;2) watermark strategy: extraction_timestamp with "10 seconds" interval&lt;/P&gt;&lt;P&gt;--&amp;gt; Removing duplicates in side of extraction_timestamp&lt;/P&gt;&lt;P&gt;Now rocksdb and state management works fine (it uses watermark correctly and does not need/check old active states --&amp;gt; just I wanted, because I know that there is no need to check these states because extraction time between these streaming batch are more than 10 seconds). The problem is that because extraction_timestamp is part of the deduplication strategy all state are "active" and rocksdb is not cleaning up these old state files.&lt;/P&gt;&lt;P&gt;I wondered is there any other option that "manually" delete these state files to manage storage size (like I say these are not used because of watermark logic)? Why rocksdb is not using watermark as a part of cleaning plan?&lt;/P&gt;</description>
      <pubDate>Fri, 14 Mar 2025 10:59:11 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/deduplication-with-rocksdb-should-old-state-files-be-deleted/m-p/112569#M9208</guid>
      <dc:creator>LasseL</dc:creator>
      <dc:date>2025-03-14T10:59:11Z</dc:date>
    </item>
    <item>
      <title>Re: Deduplication with rocksdb, should old state files be deleted manually (to manage storage size)?</title>
      <link>https://community.databricks.com/t5/get-started-discussions/deduplication-with-rocksdb-should-old-state-files-be-deleted/m-p/112823#M9209</link>
      <description>&lt;P&gt;Found solution.&amp;nbsp;&lt;A href="https://kb.databricks.com/streaming/how-to-efficiently-manage-state-store-files-in-apache-spark-streaming-applications" target="_blank"&gt;https://kb.databricks.com/streaming/how-to-efficiently-manage-state-store-files-in-apache-spark-streaming-applications&lt;/A&gt;&amp;nbsp;&amp;lt;-- these two parameters.&lt;/P&gt;</description>
      <pubDate>Mon, 17 Mar 2025 16:34:21 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/deduplication-with-rocksdb-should-old-state-files-be-deleted/m-p/112823#M9209</guid>
      <dc:creator>LasseL</dc:creator>
      <dc:date>2025-03-17T16:34:21Z</dc:date>
    </item>
  </channel>
</rss>

