Running large window spark structured streaming aggregations with small slide duration
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-17-2022 02:25 AM
I want to run aggregations on large windows (90 days) with small slide duration (5 minutes).
Straightforward solution leads to giant state around hundreds of gigabytes, which doesn't look acceptable.
Is there any best practices doing this?
Now I consider following scenarios:
- Use flatMapGroupsWithState and implement EWMA (exponentially weighted moving average) instead of average to reduce state. Is there good library for EWMA?
- Somehow join data from two streams - e.g. 90 day window with 1 day slide and 1 day window with 5 minute slide
Any other ideas?
- Labels:
-
Spark structured streaming
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-29-2022 01:13 PM
HI @Sergey Volkov,
Just a friendly follow-up. Are you still looking for help or did any of the docs that Kaniz has shared help you?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-21-2022 02:51 AM
Hi.
> Are you still looking for help
No, thank you, we have implemented EWMA using flatMapGroupsWithState.
> did any of the docs that Kaniz has shared help you?
Not really. They are just slightly connected to my problem.

