I want to run aggregations on large windows (90 days) with small slide duration (5 minutes).
Straightforward solution leads to giant state around hundreds of gigabytes, which doesn't look acceptable.
Is there any best practices doing this?
Now I consider following scenarios:
Use flatMapGroupsWithState and implement EWMA (exponentially weighted moving average) instead of average to reduce state. Is there good library for EWMA?
Somehow join data from two streams - e.g. 90 day window with 1 day slide and 1 day window with 5 minute slide
Hi @Sergey Volkov, Thanks for your question. Here are some fantastic articles on EWMA and Event-time Aggregation in Apache Spark™’s Structured Streaming. Please have a look. Let us know if that helps.
Welcome to Databricks Community: Lets learn, network and celebrate together
Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections.