cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Running large window spark structured streaming aggregations with small slide duration

serg-v
New Contributor III

I want to run aggregations on large windows (90 days) with small slide duration (5 minutes).

Straightforward solution leads to giant state around hundreds of gigabytes, which doesn't look acceptable.

Is there any best practices doing this?

Now I consider following scenarios:

  1. Use flatMapGroupsWithState and implement EWMA (exponentially weighted moving average) instead of average to reduce state. Is there good library for EWMA?
  2. Somehow join data from two streams - e.g. 90 day window with 1 day slide and 1 day window with 5 minute slide

Any other ideas?

Thread in azure q&a

3 REPLIES 3

HI @Sergey Volkov​,

Just a friendly follow-up. Are you still looking for help or did any of the docs that Kaniz has shared help you?

serg-v
New Contributor III

Hi.

> Are you still looking for help

No, thank you, we have implemented EWMA using flatMapGroupsWithState.

> did any of the docs that Kaniz has shared help you?

Not really. They are just slightly connected to my problem.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.