cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

State store configuration with applyInPandasWithState for optimal performance

PushkarDeole
New Contributor II

Hello,

We are using a stateful pipeline for data processing and analytics. For state store, we are using applyInPandasWithState function however the state needs to be persistent across node restarts etc. 

At this point, we are not sure how the state can be made persistent with applyInPandasWithState. There are some articles where it is mentioned around usage of RocksDB state store for persistence

Couple of questions:

1. What configurations is required to enable RocksDB state storage with applyInPandasWithState ?

2. What are the tuning parameters for RocksDB state store that can be tuned to provide optimal performance?

Any guidance around these would be appreciated. 

 

2 REPLIES 2

Kaniz_Fatma
Community Manager
Community Manager

Hi @PushkarDeole, To leverage RocksDB as the state store with `applyInPandasWithState` in Databricks, configure your Spark session with the following setting:

spark.conf.set("spark.sql.streaming.stateStore.providerClass", "com.databricks.sql.streaming.state.RocksDBStateStoreProvider")

Enabling this configuration allows you to manage state using RocksDB for your streaming queries. When using RocksDB, optimize performance by specifying a local directory (`rocksdb.localdir`) for RocksDB's working state, ideally on ephemeral storage like SSDs for faster access. Avoid frequent read/write operations to remote storage such as S3 to prevent performance degradation. Additionally, consider implementing asynchronous checkpoints and utilizing Databricks' state rebalancing features to enhance stateful streaming performance by distributing state evenly across nodes. Explore these options further to optimize your state management setup.

If you have more questions, feel free to ask! ๐Ÿ˜Š

@Kaniz_Fatma  thanks for the response. As I understand, there are 3 options to be explored to get optimal performance out of rocksdb based state management: 

1. specify a local directory 'rocksdb.localdir' 

--> will you be able to guide how (through which configuration) this can be specified? 

2. implement asynchronous checkpoints

--> I looked more into the details of asynchronous checkpoints through this article https://learn.microsoft.com/en-us/azure/databricks/structured-streaming/async-checkpointing

As mentioned in the limitations, cluster resizing might not work well with asynchronous checkpointing. Since we are using auto scaling feature for our databricks cluster, does that mean that we won't be able to use asynchronous checkpointing as it will frequently resize the cluster?

3. Databricks' state rebalancing

--> will explore this more

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group