cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

The State in-stream is growing too large in stream

User16826994223
Honored Contributor III

I have a customer with a streaming pipeline from Kafka to Delta. They are leveraging RocksDB, watermarking for 30min and attempting to dropDuplicates. They are seeing their state grow to 6.2 billion rows--- on a stream that hits at maximum 7000 rows per second during peak traffic . The state has grown larger than the potential total rows in the 30min window at peak times (12.6 million max in 30min window). Does anyone have any ideas or has seen this behavior before?

1 REPLY 1

shaines
New Contributor II

I've seen a similar issue with large state using flatMapGroupsWithState. It is possible that A.) they are not using the state.setTimeout correctly or B.) they are not calling state.remove() when the stored state has timed out, leaving the state to grow in size.

def groupedStateFunc(key: String, records: Iterator[A], state: GroupState[B]): Iterator[C] = {
  if (state.hasTimedOut) {
    val result = state.getOption match { ... }
    state.remove() // if this isn't called the state will leak
    result
  } else {
     val results = state.getOption match {
       case Some(lastState) => // merge new records and do something
       case None => // first state - create initial StateStore record
     }
    state.update(processingFunc(records))
    // if the timeout isn't set with an explicit point in the future (or if the value is too large) then records won't timeout and the state can grow
    state.setTimeoutTimestamp(timestampWhenStateWillExpire)
    Iterator.empty
  }
}

If this isn't the case, then it could also be an issue with the representation of time. The StateStore will track the watermark metrics for each microbatch using timestamps (epoch milliseconds) to see the acceptable boundaries (min/avg/max of watermark thresholds). It is possible that investigating here can help as well.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.