Databricks Community

LasseL · a week ago

Hi, I have following streaming setup:

I want to remove duplicates in streaming.

1) deduplication strategy is defined by two fields: extraction_timestamp and hash (row wise hash)

2) watermark strategy: extraction_timestamp with "10 seconds" interval

--> Removing duplicates in side of extraction_timestamp

Now rocksdb and state management works fine (it uses watermark correctly and does not need/check old active states --> just I wanted, because I know that there is no need to check these states because extraction time between these streaming batch are more than 10 seconds). The problem is that because extraction_timestamp is part of the deduplication strategy all state are "active" and rocksdb is not cleaning up these old state files.

I wondered is there any other option that "manually" delete these state files to manage storage size (like I say these are not used because of watermark logic)? Why rocksdb is not using watermark as a part of cleaning plan?