<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Chaining stateful Operator in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/chaining-stateful-operator/m-p/82872#M36756</link>
    <description>&lt;P&gt;I would like to do a groupby followed by a join in structured streaming. I would read from from two delta table in snapshot mode i.e. latest snapshot.&lt;/P&gt;&lt;P&gt;My question is specifically about chaining the stateful operator.&amp;nbsp;&lt;/P&gt;&lt;P&gt;groupby is update mode&lt;/P&gt;&lt;P&gt;chaning groupby and join, must be append mode overall.&amp;nbsp;&lt;/P&gt;&lt;P&gt;But does it means that the groupby would behave as if it was append as well, or the groupby can be in update mode and the join in append mode ?&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Tue, 13 Aug 2024 12:53:47 GMT</pubDate>
    <dc:creator>Maatari</dc:creator>
    <dc:date>2024-08-13T12:53:47Z</dc:date>
    <item>
      <title>Chaining stateful Operator</title>
      <link>https://community.databricks.com/t5/data-engineering/chaining-stateful-operator/m-p/82872#M36756</link>
      <description>&lt;P&gt;I would like to do a groupby followed by a join in structured streaming. I would read from from two delta table in snapshot mode i.e. latest snapshot.&lt;/P&gt;&lt;P&gt;My question is specifically about chaining the stateful operator.&amp;nbsp;&lt;/P&gt;&lt;P&gt;groupby is update mode&lt;/P&gt;&lt;P&gt;chaning groupby and join, must be append mode overall.&amp;nbsp;&lt;/P&gt;&lt;P&gt;But does it means that the groupby would behave as if it was append as well, or the groupby can be in update mode and the join in append mode ?&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 13 Aug 2024 12:53:47 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/chaining-stateful-operator/m-p/82872#M36756</guid>
      <dc:creator>Maatari</dc:creator>
      <dc:date>2024-08-13T12:53:47Z</dc:date>
    </item>
    <item>
      <title>Re: Chaining stateful Operator</title>
      <link>https://community.databricks.com/t5/data-engineering/chaining-stateful-operator/m-p/139314#M51150</link>
      <description>&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;When chaining stateful operators like groupBy (aggregation) and join in Spark Structured Streaming, there are specific rules about the output mode required for the overall query and the behavior of each operator.&lt;/P&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Output Mode Requirements&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;The&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;groupBy&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;operator (stateful aggregation) supports&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;update&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;and&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;complete&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;output modes when used alone because it may update existing aggregated values as new data arrives.​&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;The&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;join&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;between two streaming DataFrames must use&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;append&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;output mode overall, meaning only newly joined rows are emitted downstream.​&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 class="mb-2 mt-4 font-display font-semimedium text-base first:mt-0"&gt;Behavior When Chaining Operators&lt;/H2&gt;
&lt;UL class="marker:text-quiet list-disc"&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;When you chain a groupBy (update mode) followed by a streaming join, the overall query is required to run in&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;append&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;mode because joins in Structured Streaming only support append output.​&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;amp;&amp;gt;p]:pt-0 [&amp;amp;&amp;gt;p]:mb-2 [&amp;amp;&amp;gt;p]:my-0"&gt;
&lt;P class="my-2 [&amp;amp;+p]:mt-4 [&amp;amp;_strong:has(+br)]:inline-block [&amp;amp;_strong:has(+br)]:pb-2"&gt;&lt;STRONG&gt;This does not mean that the groupBy operator itself shifts to append mode&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;internally. The aggregation still behaves like an update aggregation: it maintains state and recalculates aggregates as new data arrives. However, Spark will output only the newly joined records, not updated aggregations, downstream—effectively discarding any updated rows not resulting in a new join.​&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;</description>
      <pubDate>Mon, 17 Nov 2025 11:46:13 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/chaining-stateful-operator/m-p/139314#M51150</guid>
      <dc:creator>mark_ott</dc:creator>
      <dc:date>2025-11-17T11:46:13Z</dc:date>
    </item>
  </channel>
</rss>

