<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Extreme RocksDB memory usage in Administration &amp; Architecture</title>
    <link>https://community.databricks.com/t5/administration-architecture/extreme-rocksdb-memory-usage/m-p/42972#M347</link>
    <description>&lt;P&gt;During migration to production workload, I switched some queries to use RocksDB. I am concerned with its memory usage though.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Here is sample output from my streaming query:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="javascript"&gt;  "stateOperators" : [ {
    "operatorName" : "dedupeWithinWatermark",
    "numRowsTotal" : 611788,
    "numRowsUpdated" : 610009,
    "allUpdatesTimeMs" : 7303,
    "numRowsRemoved" : 633148,
    "allRemovalsTimeMs" : 6082,
    "commitTimeMs" : 10363,
    "memoryUsedBytes" : 32142263729,
    "numRowsDroppedByWatermark" : 0,
    "numShufflePartitions" : 4,
    "numStateStoreInstances" : 4,
    "customMetrics" : {
      "numDroppedDuplicateRows" : 0,
      "rocksdbBytesCopied" : 61561365,
      "rocksdbCommitCheckpointLatency" : 198,
      "rocksdbCommitCompactLatency" : 0,
      "rocksdbCommitFileSyncLatencyMs" : 3856,
      "rocksdbCommitFlushLatency" : 6302,
      "rocksdbCommitPauseLatency" : 0,
      "rocksdbCommitWriteBatchLatency" : 0,
      "rocksdbFilesCopied" : 4,
      "rocksdbFilesReused" : 11,
      "rocksdbGetCount" : 1853166,
      "rocksdbGetLatency" : 5490,
      "rocksdbPinnedBlocksMemoryUsage" : 198117968,
      "rocksdbPutCount" : 1243157,
      "rocksdbPutLatency" : 2073,
      "rocksdbReadBlockCacheHitCount" : 1928135,
      "rocksdbReadBlockCacheMissCount" : 21340,
      "rocksdbSstFileSize" : 201411521,
      "rocksdbTotalBytesRead" : 10763516,
      "rocksdbTotalBytesReadByCompaction" : 0,
      "rocksdbTotalBytesReadThroughIterator" : 139455496,
      "rocksdbTotalBytesWritten" : 146504893,
      "rocksdbTotalBytesWrittenByCompaction" : 0,
      "rocksdbTotalBytesWrittenByFlush" : 61562581,
      "rocksdbTotalCompactionLatencyMs" : 0,
      "rocksdbTotalFlushLatencyMs" : 2969,
      "rocksdbWriterStallLatencyMs" : 0,
      "rocksdbZipFileBytesUncompressed" : 41221
    }
  } ]&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;If I understand this correctly, 611788 keys are stored in the database. The key is defined as:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;.withWatermark('kafka_timestamp', '5 minutes')
[...]
.dropDuplicatesWithinWatermark(['brand', 'transaction_id', 'status'])&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;where kafka_timestamp is of type Timestamp, and other keys are all Strings with 16 characters maximum.&lt;/P&gt;&lt;P&gt;It gets even worse after the query is running for some time, over 40GB is used for just 20k entries.&lt;/P&gt;&lt;P&gt;Am I reading this incorrectly? Can I control this somehow? Or is this expected behaviour, as this is using 50 to 100x more than I would expect in the most extreme scenario.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Any insight would be highly appreciated, thank you!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Thu, 31 Aug 2023 12:45:38 GMT</pubDate>
    <dc:creator>PetePP</dc:creator>
    <dc:date>2023-08-31T12:45:38Z</dc:date>
    <item>
      <title>Extreme RocksDB memory usage</title>
      <link>https://community.databricks.com/t5/administration-architecture/extreme-rocksdb-memory-usage/m-p/42972#M347</link>
      <description>&lt;P&gt;During migration to production workload, I switched some queries to use RocksDB. I am concerned with its memory usage though.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Here is sample output from my streaming query:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="javascript"&gt;  "stateOperators" : [ {
    "operatorName" : "dedupeWithinWatermark",
    "numRowsTotal" : 611788,
    "numRowsUpdated" : 610009,
    "allUpdatesTimeMs" : 7303,
    "numRowsRemoved" : 633148,
    "allRemovalsTimeMs" : 6082,
    "commitTimeMs" : 10363,
    "memoryUsedBytes" : 32142263729,
    "numRowsDroppedByWatermark" : 0,
    "numShufflePartitions" : 4,
    "numStateStoreInstances" : 4,
    "customMetrics" : {
      "numDroppedDuplicateRows" : 0,
      "rocksdbBytesCopied" : 61561365,
      "rocksdbCommitCheckpointLatency" : 198,
      "rocksdbCommitCompactLatency" : 0,
      "rocksdbCommitFileSyncLatencyMs" : 3856,
      "rocksdbCommitFlushLatency" : 6302,
      "rocksdbCommitPauseLatency" : 0,
      "rocksdbCommitWriteBatchLatency" : 0,
      "rocksdbFilesCopied" : 4,
      "rocksdbFilesReused" : 11,
      "rocksdbGetCount" : 1853166,
      "rocksdbGetLatency" : 5490,
      "rocksdbPinnedBlocksMemoryUsage" : 198117968,
      "rocksdbPutCount" : 1243157,
      "rocksdbPutLatency" : 2073,
      "rocksdbReadBlockCacheHitCount" : 1928135,
      "rocksdbReadBlockCacheMissCount" : 21340,
      "rocksdbSstFileSize" : 201411521,
      "rocksdbTotalBytesRead" : 10763516,
      "rocksdbTotalBytesReadByCompaction" : 0,
      "rocksdbTotalBytesReadThroughIterator" : 139455496,
      "rocksdbTotalBytesWritten" : 146504893,
      "rocksdbTotalBytesWrittenByCompaction" : 0,
      "rocksdbTotalBytesWrittenByFlush" : 61562581,
      "rocksdbTotalCompactionLatencyMs" : 0,
      "rocksdbTotalFlushLatencyMs" : 2969,
      "rocksdbWriterStallLatencyMs" : 0,
      "rocksdbZipFileBytesUncompressed" : 41221
    }
  } ]&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;If I understand this correctly, 611788 keys are stored in the database. The key is defined as:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;.withWatermark('kafka_timestamp', '5 minutes')
[...]
.dropDuplicatesWithinWatermark(['brand', 'transaction_id', 'status'])&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;where kafka_timestamp is of type Timestamp, and other keys are all Strings with 16 characters maximum.&lt;/P&gt;&lt;P&gt;It gets even worse after the query is running for some time, over 40GB is used for just 20k entries.&lt;/P&gt;&lt;P&gt;Am I reading this incorrectly? Can I control this somehow? Or is this expected behaviour, as this is using 50 to 100x more than I would expect in the most extreme scenario.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Any insight would be highly appreciated, thank you!&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 31 Aug 2023 12:45:38 GMT</pubDate>
      <guid>https://community.databricks.com/t5/administration-architecture/extreme-rocksdb-memory-usage/m-p/42972#M347</guid>
      <dc:creator>PetePP</dc:creator>
      <dc:date>2023-08-31T12:45:38Z</dc:date>
    </item>
    <item>
      <title>Re: Extreme RocksDB memory usage</title>
      <link>https://community.databricks.com/t5/administration-architecture/extreme-rocksdb-memory-usage/m-p/43028#M350</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/87821"&gt;@PetePP&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;memoryUsedBytes: 32142263729&lt;/P&gt;&lt;P&gt;32 gb is the current usage. This is the memory used to store the 611k records. This is the storage memory required to store all the columns in the dataset and not just the 4 columns mentioned in the watermark.&lt;/P&gt;</description>
      <pubDate>Fri, 01 Sep 2023 04:59:31 GMT</pubDate>
      <guid>https://community.databricks.com/t5/administration-architecture/extreme-rocksdb-memory-usage/m-p/43028#M350</guid>
      <dc:creator>Tharun-Kumar</dc:creator>
      <dc:date>2023-09-01T04:59:31Z</dc:date>
    </item>
    <item>
      <title>Re: Extreme RocksDB memory usage</title>
      <link>https://community.databricks.com/t5/administration-architecture/extreme-rocksdb-memory-usage/m-p/43059#M351</link>
      <description>&lt;P&gt;Thank you for the input. Is there any particular reason why deduplication watermark makes it store everything and not just the key needed for deduplication? The 1st record has to be written to the table anyway, and its content is irrelevant as it just drops later records that get a hit.&lt;/P&gt;&lt;P&gt;Is there any way to control this behavior?&amp;nbsp; I know I could do a constraint on write, but that seems really excessive as the table has millions of rows&amp;nbsp; and I really need to look just a few minutes past.&lt;/P&gt;</description>
      <pubDate>Fri, 01 Sep 2023 09:47:46 GMT</pubDate>
      <guid>https://community.databricks.com/t5/administration-architecture/extreme-rocksdb-memory-usage/m-p/43059#M351</guid>
      <dc:creator>PetePP</dc:creator>
      <dc:date>2023-09-01T09:47:46Z</dc:date>
    </item>
  </channel>
</rss>

