During migration to production workload, I switched some queries to use RocksDB. I am concerned with its memory usage though.
Here is sample output from my streaming query:
"stateOperators" : [ {
"operatorName" : "dedupeWithinWatermark",
"numRowsTotal" : 611788,
"numRowsUpdated" : 610009,
"allUpdatesTimeMs" : 7303,
"numRowsRemoved" : 633148,
"allRemovalsTimeMs" : 6082,
"commitTimeMs" : 10363,
"memoryUsedBytes" : 32142263729,
"numRowsDroppedByWatermark" : 0,
"numShufflePartitions" : 4,
"numStateStoreInstances" : 4,
"customMetrics" : {
"numDroppedDuplicateRows" : 0,
"rocksdbBytesCopied" : 61561365,
"rocksdbCommitCheckpointLatency" : 198,
"rocksdbCommitCompactLatency" : 0,
"rocksdbCommitFileSyncLatencyMs" : 3856,
"rocksdbCommitFlushLatency" : 6302,
"rocksdbCommitPauseLatency" : 0,
"rocksdbCommitWriteBatchLatency" : 0,
"rocksdbFilesCopied" : 4,
"rocksdbFilesReused" : 11,
"rocksdbGetCount" : 1853166,
"rocksdbGetLatency" : 5490,
"rocksdbPinnedBlocksMemoryUsage" : 198117968,
"rocksdbPutCount" : 1243157,
"rocksdbPutLatency" : 2073,
"rocksdbReadBlockCacheHitCount" : 1928135,
"rocksdbReadBlockCacheMissCount" : 21340,
"rocksdbSstFileSize" : 201411521,
"rocksdbTotalBytesRead" : 10763516,
"rocksdbTotalBytesReadByCompaction" : 0,
"rocksdbTotalBytesReadThroughIterator" : 139455496,
"rocksdbTotalBytesWritten" : 146504893,
"rocksdbTotalBytesWrittenByCompaction" : 0,
"rocksdbTotalBytesWrittenByFlush" : 61562581,
"rocksdbTotalCompactionLatencyMs" : 0,
"rocksdbTotalFlushLatencyMs" : 2969,
"rocksdbWriterStallLatencyMs" : 0,
"rocksdbZipFileBytesUncompressed" : 41221
}
} ]
If I understand this correctly, 611788 keys are stored in the database. The key is defined as:
.withWatermark('kafka_timestamp', '5 minutes')
[...]
.dropDuplicatesWithinWatermark(['brand', 'transaction_id', 'status'])
where kafka_timestamp is of type Timestamp, and other keys are all Strings with 16 characters maximum.
It gets even worse after the query is running for some time, over 40GB is used for just 20k entries.
Am I reading this incorrectly? Can I control this somehow? Or is this expected behaviour, as this is using 50 to 100x more than I would expect in the most extreme scenario.
Any insight would be highly appreciated, thank you!