Deduplication with rocksdb, should old state files be deleted manually (to manage storage size)?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
yesterday
Hi, I have following streaming setup:
I want to remove duplicates in streaming.
1) deduplication strategy is defined by two fields: extraction_timestamp and hash (row wise hash)
2) watermark strategy: extraction_timestamp with "10 seconds" interval
--> Removing duplicates in side of extraction_timestamp
Now rocksdb and state management works fine (it uses watermark correctly and does not need/check old active states --> just I wanted, because I know that there is no need to check these states because extraction time between these streaming batch are more than 10 seconds). The problem is that because extraction_timestamp is part of the deduplication strategy all state are "active" and rocksdb is not cleaning up these old state files.
I wondered is there any other option that "manually" delete these state files to manage storage size (like I say these are not used because of watermark logic)? Why rocksdb is not using watermark as a part of cleaning plan?

