Running large window spark structured streaming ag... - Databricks Community - 17437

Register to join the community

Data Engineering

Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.

I want to run aggregations on large windows (90 days) with small slide duration (5 minutes).

Straightforward solution leads to giant state around hundreds of gigabytes, which doesn't look acceptable.

Is there any best practices doing this?

Now I consider following scenarios:

Use flatMapGroupsWithState and implement EWMA (exponentially weighted moving average) instead of average to reduce state. Is there good library for EWMA?
Somehow join data from two streams - e.g. 90 day window with 1 day slide and 1 day window with 5 minute slide

Any other ideas?

Thread in azure q&a

2 REPLIES 2

HI @Sergey Volkov,

Just a friendly follow-up. Are you still looking for help or did any of the docs that Kaniz has shared help you?

Hi.

> Are you still looking for help

No, thank you, we have implemented EWMA using flatMapGroupsWithState.

> did any of the docs that Kaniz has shared help you?

Not really. They are just slightly connected to my problem.

never-displayed

You must be signed in to add attachments

never-displayed

Announcements

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences

Databricks Community Champion - December 2024 - Sujesh Menon