Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
I want to run aggregations on large windows (90 days) with small slide duration (5 minutes).
Straightforward solution leads to giant state around hundreds of gigabytes, which doesn't look acceptable.
Is there any best practices doing this?
Now I consider following scenarios:
Use flatMapGroupsWithState and implement EWMA (exponentially weighted moving average) instead of average to reduce state. Is there good library for EWMA?
Somehow join data from two streams - e.g. 90 day window with 1 day slide and 1 day window with 5 minute slide
No, thank you, we have implemented EWMA using flatMapGroupsWithState.
> did any of the docs that Kaniz has shared help you?
Not really. They are just slightly connected to my problem.
Connect with Databricks Users in Your Area
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.