<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: State store configuration with applyInPandasWithState for optimal performance in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/state-store-configuration-with-applyinpandaswithstate-for/m-p/79093#M35671</link>
    <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9"&gt;@Retired_mod&lt;/a&gt;&amp;nbsp; thanks for the response. As I understand, there are 3 options to be explored to get optimal performance out of rocksdb based state management:&amp;nbsp;&lt;/P&gt;&lt;P&gt;1. specify a local directory 'rocksdb.localdir'&amp;nbsp;&lt;/P&gt;&lt;P&gt;--&amp;gt; will you be able to guide how (through which configuration) this can be specified?&amp;nbsp;&lt;/P&gt;&lt;P&gt;2. implement asynchronous checkpoints&lt;/P&gt;&lt;P&gt;--&amp;gt; I looked more into the details of asynchronous checkpoints through this article&amp;nbsp;&lt;A href="https://learn.microsoft.com/en-us/azure/databricks/structured-streaming/async-checkpointing" target="_blank"&gt;https://learn.microsoft.com/en-us/azure/databricks/structured-streaming/async-checkpointing&lt;/A&gt;&lt;/P&gt;&lt;P&gt;As mentioned in the limitations, cluster resizing might not work well with asynchronous checkpointing. Since we are using auto scaling feature for our databricks cluster, does that mean that we won't be able to use asynchronous checkpointing as it will frequently resize the cluster?&lt;/P&gt;&lt;P&gt;3. Databricks' state rebalancing&lt;/P&gt;&lt;P&gt;--&amp;gt; will explore this more&lt;/P&gt;</description>
    <pubDate>Wed, 17 Jul 2024 09:58:50 GMT</pubDate>
    <dc:creator>PushkarDeole</dc:creator>
    <dc:date>2024-07-17T09:58:50Z</dc:date>
    <item>
      <title>State store configuration with applyInPandasWithState for optimal performance</title>
      <link>https://community.databricks.com/t5/data-engineering/state-store-configuration-with-applyinpandaswithstate-for/m-p/78906#M35634</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;We are using a stateful pipeline for data processing and analytics. For state store, we are using applyInPandasWithState function however the state needs to be persistent across node restarts etc.&amp;nbsp;&lt;/P&gt;&lt;P&gt;At this point, we are not sure how the state can be made persistent with applyInPandasWithState. There are some articles where it is mentioned around usage of RocksDB state store for persistence&lt;/P&gt;&lt;P&gt;Couple of questions:&lt;/P&gt;&lt;P&gt;1. What configurations is required to enable RocksDB state storage with applyInPandasWithState ?&lt;/P&gt;&lt;P&gt;2. What are the tuning parameters for RocksDB state store that can be tuned to provide optimal performance?&lt;/P&gt;&lt;P&gt;Any guidance around these would be appreciated.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 16 Jul 2024 05:20:43 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/state-store-configuration-with-applyinpandaswithstate-for/m-p/78906#M35634</guid>
      <dc:creator>PushkarDeole</dc:creator>
      <dc:date>2024-07-16T05:20:43Z</dc:date>
    </item>
    <item>
      <title>Re: State store configuration with applyInPandasWithState for optimal performance</title>
      <link>https://community.databricks.com/t5/data-engineering/state-store-configuration-with-applyinpandaswithstate-for/m-p/79093#M35671</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9"&gt;@Retired_mod&lt;/a&gt;&amp;nbsp; thanks for the response. As I understand, there are 3 options to be explored to get optimal performance out of rocksdb based state management:&amp;nbsp;&lt;/P&gt;&lt;P&gt;1. specify a local directory 'rocksdb.localdir'&amp;nbsp;&lt;/P&gt;&lt;P&gt;--&amp;gt; will you be able to guide how (through which configuration) this can be specified?&amp;nbsp;&lt;/P&gt;&lt;P&gt;2. implement asynchronous checkpoints&lt;/P&gt;&lt;P&gt;--&amp;gt; I looked more into the details of asynchronous checkpoints through this article&amp;nbsp;&lt;A href="https://learn.microsoft.com/en-us/azure/databricks/structured-streaming/async-checkpointing" target="_blank"&gt;https://learn.microsoft.com/en-us/azure/databricks/structured-streaming/async-checkpointing&lt;/A&gt;&lt;/P&gt;&lt;P&gt;As mentioned in the limitations, cluster resizing might not work well with asynchronous checkpointing. Since we are using auto scaling feature for our databricks cluster, does that mean that we won't be able to use asynchronous checkpointing as it will frequently resize the cluster?&lt;/P&gt;&lt;P&gt;3. Databricks' state rebalancing&lt;/P&gt;&lt;P&gt;--&amp;gt; will explore this more&lt;/P&gt;</description>
      <pubDate>Wed, 17 Jul 2024 09:58:50 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/state-store-configuration-with-applyinpandaswithstate-for/m-p/79093#M35671</guid>
      <dc:creator>PushkarDeole</dc:creator>
      <dc:date>2024-07-17T09:58:50Z</dc:date>
    </item>
  </channel>
</rss>

