cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Where does custom state store the data

Soma
Valued Contributor

There are couple of custom state functions like mapgroupswithstate,

ApplyinpandaswithState

Which has a internal state maintained is it maintained in same statestore(rocksdb) as aggregation state store function โ€‹

1 ACCEPTED SOLUTION

Accepted Solutions

Anonymous
Not applicable

@somanath Sankaranโ€‹ :

Yes, custom state functions like mapGroupsWithState and applyInPandasWithState use the same state store as the built-in aggregation state store. By default, this state is stored in RocksDB, which is an embedded, persistent key-value store that is optimized for storing and retrieving large amounts of data.

The state store is managed by the Databricks runtime and is automatically distributed across the worker nodes in the cluster. This allows the state to be shared and updated across multiple tasks running in parallel. The state is also fault-tolerant and can be recovered in case of a node failure.

When using custom state functions, it's important to keep in mind that the amount of state maintained by the function can have a significant impact on cluster performance and memory usage. It's important to properly configure the state timeout and eviction policies to ensure that old, unused state is regularly cleaned up to avoid running out of memory.

View solution in original post

3 REPLIES 3

Anonymous
Not applicable

@somanath Sankaranโ€‹ :

Yes, custom state functions like mapGroupsWithState and applyInPandasWithState use the same state store as the built-in aggregation state store. By default, this state is stored in RocksDB, which is an embedded, persistent key-value store that is optimized for storing and retrieving large amounts of data.

The state store is managed by the Databricks runtime and is automatically distributed across the worker nodes in the cluster. This allows the state to be shared and updated across multiple tasks running in parallel. The state is also fault-tolerant and can be recovered in case of a node failure.

When using custom state functions, it's important to keep in mind that the amount of state maintained by the function can have a significant impact on cluster performance and memory usage. It's important to properly configure the state timeout and eviction policies to ensure that old, unused state is regularly cleaned up to avoid running out of memory.

Hello,

As I understand from above response, the state store is distributed across worker nodes, which means the state store would be stored on the local storage of each worker node. Correct me I understood incorrectly?

So, if it is stored on local storage of worker node, then that storage is ephemeral storage which means it would be wiped out on restart of that worker node. In this case how the state is restored on restart or failure of a worker node?

 

Anonymous
Not applicable

Hi @somanath Sankaranโ€‹ 

Thank you for posting your question in our community! We are happy to assist you.

To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?

This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance! 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group