cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Where does custom state store the data

Soma
Valued Contributor

There are couple of custom state functions like mapgroupswithstate,

ApplyinpandaswithState

Which has a internal state maintained is it maintained in same statestore(rocksdb) as aggregation state store function โ€‹

1 ACCEPTED SOLUTION

Accepted Solutions

Anonymous
Not applicable

@somanath Sankaranโ€‹ :

Yes, custom state functions like mapGroupsWithState and applyInPandasWithState use the same state store as the built-in aggregation state store. By default, this state is stored in RocksDB, which is an embedded, persistent key-value store that is optimized for storing and retrieving large amounts of data.

The state store is managed by the Databricks runtime and is automatically distributed across the worker nodes in the cluster. This allows the state to be shared and updated across multiple tasks running in parallel. The state is also fault-tolerant and can be recovered in case of a node failure.

When using custom state functions, it's important to keep in mind that the amount of state maintained by the function can have a significant impact on cluster performance and memory usage. It's important to properly configure the state timeout and eviction policies to ensure that old, unused state is regularly cleaned up to avoid running out of memory.

View solution in original post

2 REPLIES 2

Anonymous
Not applicable

@somanath Sankaranโ€‹ :

Yes, custom state functions like mapGroupsWithState and applyInPandasWithState use the same state store as the built-in aggregation state store. By default, this state is stored in RocksDB, which is an embedded, persistent key-value store that is optimized for storing and retrieving large amounts of data.

The state store is managed by the Databricks runtime and is automatically distributed across the worker nodes in the cluster. This allows the state to be shared and updated across multiple tasks running in parallel. The state is also fault-tolerant and can be recovered in case of a node failure.

When using custom state functions, it's important to keep in mind that the amount of state maintained by the function can have a significant impact on cluster performance and memory usage. It's important to properly configure the state timeout and eviction policies to ensure that old, unused state is regularly cleaned up to avoid running out of memory.

Anonymous
Not applicable

Hi @somanath Sankaranโ€‹ 

Thank you for posting your question in our community! We are happy to assist you.

To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?

This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance! 

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.