- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
a week ago
Hi, I have following streaming setup:
I want to remove duplicates in streaming.
1) deduplication strategy is defined by two fields: extraction_timestamp and hash (row wise hash)
2) watermark strategy: extraction_timestamp with "10 seconds" interval
--> Removing duplicates in side of extraction_timestamp
Now rocksdb and state management works fine (it uses watermark correctly and does not need/check old active states --> just I wanted, because I know that there is no need to check these states because extraction time between these streaming batch are more than 10 seconds). The problem is that because extraction_timestamp is part of the deduplication strategy all state are "active" and rocksdb is not cleaning up these old state files.
I wondered is there any other option that "manually" delete these state files to manage storage size (like I say these are not used because of watermark logic)? Why rocksdb is not using watermark as a part of cleaning plan?
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Monday
Found solution. https://kb.databricks.com/streaming/how-to-efficiently-manage-state-store-files-in-apache-spark-stre... <-- these two parameters.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Monday
Found solution. https://kb.databricks.com/streaming/how-to-efficiently-manage-state-store-files-in-apache-spark-stre... <-- these two parameters.

