<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Maintaining Custom State in Structured Streaming in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/maintaining-custom-state-in-structured-streaming/m-p/6870#M2872</link>
    <description>&lt;P&gt;I am consuming an IoT stream with thousands of different signals using Structured Streaming. During processing of the stream, I need to know the previous timestamp and value for each signal in the micro batch. The signal stream is eventually written to a delta table. Every signal is expected to be sent at least once every hour.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Is it possible to make use of the internal State Store as a cache to store this custom state of the previous timestamp and value for each signal?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;If not, what would be the canonical approach to maintain such a state?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;These are the approaches that I can think of.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Approach 1:&lt;/P&gt;&lt;P&gt;Perform a join on the stream with the target table itself to get the previous signal timestamp and value.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Approach 2:&lt;/P&gt;&lt;P&gt;Maintain a separate ‘state table’ containing the previous timestamp and value for each signal. The ‘state table’ would then be joined with the stream to get the previous signal timestamp and value. &lt;/P&gt;&lt;P&gt;On receiving new signal values, the ‘state table’ would be updated using merge into.&lt;/P&gt;</description>
    <pubDate>Wed, 29 Mar 2023 13:35:43 GMT</pubDate>
    <dc:creator>Starki</dc:creator>
    <dc:date>2023-03-29T13:35:43Z</dc:date>
    <item>
      <title>Maintaining Custom State in Structured Streaming</title>
      <link>https://community.databricks.com/t5/data-engineering/maintaining-custom-state-in-structured-streaming/m-p/6870#M2872</link>
      <description>&lt;P&gt;I am consuming an IoT stream with thousands of different signals using Structured Streaming. During processing of the stream, I need to know the previous timestamp and value for each signal in the micro batch. The signal stream is eventually written to a delta table. Every signal is expected to be sent at least once every hour.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Is it possible to make use of the internal State Store as a cache to store this custom state of the previous timestamp and value for each signal?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;If not, what would be the canonical approach to maintain such a state?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;These are the approaches that I can think of.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Approach 1:&lt;/P&gt;&lt;P&gt;Perform a join on the stream with the target table itself to get the previous signal timestamp and value.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Approach 2:&lt;/P&gt;&lt;P&gt;Maintain a separate ‘state table’ containing the previous timestamp and value for each signal. The ‘state table’ would then be joined with the stream to get the previous signal timestamp and value. &lt;/P&gt;&lt;P&gt;On receiving new signal values, the ‘state table’ would be updated using merge into.&lt;/P&gt;</description>
      <pubDate>Wed, 29 Mar 2023 13:35:43 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/maintaining-custom-state-in-structured-streaming/m-p/6870#M2872</guid>
      <dc:creator>Starki</dc:creator>
      <dc:date>2023-03-29T13:35:43Z</dc:date>
    </item>
    <item>
      <title>Re: Maintaining Custom State in Structured Streaming</title>
      <link>https://community.databricks.com/t5/data-engineering/maintaining-custom-state-in-structured-streaming/m-p/6871#M2873</link>
      <description>&lt;P&gt;@Suteja Kanuri​&amp;nbsp;&lt;/P&gt;&lt;P&gt;Tried the above on streaming DF&lt;/P&gt;&lt;P&gt;But facing the below error&lt;/P&gt;&lt;P&gt;AttributeError: 'DataFrame' object has no attribute 'groupByKey'&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Can you please let me know DBR runtime &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Sat, 08 Apr 2023 17:34:31 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/maintaining-custom-state-in-structured-streaming/m-p/6871#M2873</guid>
      <dc:creator>Soma</dc:creator>
      <dc:date>2023-04-08T17:34:31Z</dc:date>
    </item>
  </channel>
</rss>

