<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic The State in-stream is growing too large in stream in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/the-state-in-stream-is-growing-too-large-in-stream/m-p/17886#M11809</link>
    <description>&lt;P&gt;I have a customer with a streaming pipeline from Kafka to Delta. They are leveraging RocksDB, watermarking for 30min and attempting to dropDuplicates. They are seeing their state grow to 6.2 billion rows--- on a stream that hits at maximum 7000 rows per second during peak traffic . The state has grown larger than the potential total rows in the 30min window at peak times (12.6 million max in 30min window). Does anyone have any ideas or has seen this behavior before?&lt;/P&gt;</description>
    <pubDate>Mon, 28 Jun 2021 08:21:33 GMT</pubDate>
    <dc:creator>User16826994223</dc:creator>
    <dc:date>2021-06-28T08:21:33Z</dc:date>
    <item>
      <title>The State in-stream is growing too large in stream</title>
      <link>https://community.databricks.com/t5/data-engineering/the-state-in-stream-is-growing-too-large-in-stream/m-p/17886#M11809</link>
      <description>&lt;P&gt;I have a customer with a streaming pipeline from Kafka to Delta. They are leveraging RocksDB, watermarking for 30min and attempting to dropDuplicates. They are seeing their state grow to 6.2 billion rows--- on a stream that hits at maximum 7000 rows per second during peak traffic . The state has grown larger than the potential total rows in the 30min window at peak times (12.6 million max in 30min window). Does anyone have any ideas or has seen this behavior before?&lt;/P&gt;</description>
      <pubDate>Mon, 28 Jun 2021 08:21:33 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/the-state-in-stream-is-growing-too-large-in-stream/m-p/17886#M11809</guid>
      <dc:creator>User16826994223</dc:creator>
      <dc:date>2021-06-28T08:21:33Z</dc:date>
    </item>
    <item>
      <title>Re: The State in-stream is growing too large in stream</title>
      <link>https://community.databricks.com/t5/data-engineering/the-state-in-stream-is-growing-too-large-in-stream/m-p/17887#M11810</link>
      <description>&lt;P&gt;I've seen a similar issue with large state using flatMapGroupsWithState. It is possible that A.) they are not using the state.setTimeout correctly or B.) they are not calling state.remove() when the stored state has timed out, leaving the state to grow in size.&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;def groupedStateFunc(key: String, records: Iterator[A], state: GroupState[B]): Iterator[C] = {
  if (state.hasTimedOut) {
    val result = state.getOption match { ... }
    state.remove() // if this isn't called the state will leak
    result
  } else {
     val results = state.getOption match {
       case Some(lastState) =&amp;gt; // merge new records and do something
       case None =&amp;gt; // first state - create initial StateStore record
     }
    state.update(processingFunc(records))
    // if the timeout isn't set with an explicit point in the future (or if the value is too large) then records won't timeout and the state can grow
    state.setTimeoutTimestamp(timestampWhenStateWillExpire)
    Iterator.empty
  }
}&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;If this isn't the case, then it could also be an issue with the representation of time. The StateStore will track the watermark metrics for each microbatch using timestamps (epoch milliseconds) to see the acceptable boundaries (min/avg/max of watermark thresholds). It is possible that investigating here can help as well. &lt;/P&gt;</description>
      <pubDate>Wed, 25 Aug 2021 16:36:00 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/the-state-in-stream-is-growing-too-large-in-stream/m-p/17887#M11810</guid>
      <dc:creator>shaines</dc:creator>
      <dc:date>2021-08-25T16:36:00Z</dc:date>
    </item>
  </channel>
</rss>

