Databricks

Soma · ‎04-09-2023

There are couple of custom state functions like mapgroupswithstate,

ApplyinpandaswithState

Which has a internal state maintained is it maintained in same statestore(rocksdb) as aggregation state store function

Anonymous · ‎04-10-2023

@somanath Sankaran :

Yes, custom state functions like mapGroupsWithState and applyInPandasWithState use the same state store as the built-in aggregation state store. By default, this state is stored in RocksDB, which is an embedded, persistent key-value store that is optimized for storing and retrieving large amounts of data.

The state store is managed by the Databricks runtime and is automatically distributed across the worker nodes in the cluster. This allows the state to be shared and updated across multiple tasks running in parallel. The state is also fault-tolerant and can be recovered in case of a node failure.

When using custom state functions, it's important to keep in mind that the amount of state maintained by the function can have a significant impact on cluster performance and memory usage. It's important to properly configure the state timeout and eviction policies to ensure that old, unused state is regularly cleaned up to avoid running out of memory.

View solution in original post

Anonymous · ‎04-10-2023

@somanath Sankaran :

Yes, custom state functions like mapGroupsWithState and applyInPandasWithState use the same state store as the built-in aggregation state store. By default, this state is stored in RocksDB, which is an embedded, persistent key-value store that is optimized for storing and retrieving large amounts of data.

The state store is managed by the Databricks runtime and is automatically distributed across the worker nodes in the cluster. This allows the state to be shared and updated across multiple tasks running in parallel. The state is also fault-tolerant and can be recovered in case of a node failure.

When using custom state functions, it's important to keep in mind that the amount of state maintained by the function can have a significant impact on cluster performance and memory usage. It's important to properly configure the state timeout and eviction policies to ensure that old, unused state is regularly cleaned up to avoid running out of memory.

Anonymous · ‎04-12-2023

Hi @somanath Sankaran

Thank you for posting your question in our community! We are happy to assist you.

To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?

This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance!

Databricks

Where does custom state store the data

Databricks Community Social - May 2024

🔔 Attention Databricks Academy Users: SSO Implementation Incoming! Secure Your Account Today!

Announcing the General Availability of Databricks Asset Bundles

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs