I have a streaming aggregation query with highly variable micro-batch processing times. Seeing a lot of GC pauses in the logs . Any pointers on how to debug ?

sajith_appukutt — Wed, 09 Jun 2021 08:20:06 GMT

Though the data volume is relatively even, the streaming aggregation query is showing highly variable micro-batch processing times

Re: I have a streaming aggregation query with highly variable micro-batch processing times. Seeing a lot of GC pauses in the logs . Any pointers on how to debug ?

sajith_appukutt — Thu, 17 Jun 2021 23:14:58 GMT

By default, the state data (streaming aggregation query) is maintained in the JVM memory of the executors and large number of state objects could put memory pressure on the JVM causing high GC pauses. If you have stateful operations in your streaming query, it is recommended to use a more optimized state management solution based on RocksDB.

More details at https://docs.databricks.com/spark/latest/structured-streaming/production.html#optimize-performance-of-stateful-streaming-queries

topic Re: I have a streaming aggregation query with highly variable micro-batch processing times. Seeing a lot of GC pauses in the logs . Any pointers on how to debug ? in Data Engineering

I have a streaming aggregation query with highly variable micro-batch processing times. Seeing a lot of GC pauses in the logs . Any pointers on how to debug ?

Re: I have a streaming aggregation query with highly variable micro-batch processing times. Seeing a lot of GC pauses in the logs . Any pointers on how to debug ?