<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Streaming read and writing with aggregation in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/streaming-read-and-writing-with-aggregation/m-p/156736#M54474</link>
    <description>&lt;P class="p8i6j01 paragraph"&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/228755"&gt;@Guillermo-HR&lt;/a&gt;&amp;nbsp;&lt;/P&gt;
&lt;P class="p8i6j01 paragraph"&gt;Yes — &lt;STRONG&gt;batch is usually the right fix here&lt;/STRONG&gt;.&lt;/P&gt;
&lt;P class="p8i6j01 paragraph"&gt;What’s happening is that your query is using &lt;STRONG&gt;event-time window aggregation in Structured Streaming with &lt;CODE class="p8i6j0f"&gt;append&lt;/CODE&gt; output mode&lt;/STRONG&gt;. In that mode, Spark only emits a window &lt;STRONG&gt;after it is sure the window is closed&lt;/STRONG&gt; according to the watermark. With monthly files and &lt;CODE class="p8i6j0f"&gt;availableNow=True&lt;/CODE&gt;, the &lt;STRONG&gt;final day’s 1-day window often never gets finalized during that run&lt;/STRONG&gt;, so it stays buffered and appears “missing”.&lt;/P&gt;
&lt;P class="p8i6j01 paragraph"&gt;A few key points:&lt;/P&gt;
&lt;UL class="p8i6j07 p8i6j02"&gt;
&lt;LI class="p8i6j0a"&gt;&lt;CODE class="p8i6j0f"&gt;withWatermark("datetime", "0 second")&lt;/CODE&gt; does &lt;STRONG&gt;not&lt;/STRONG&gt; mean “emit immediately”; it still depends on seeing &lt;STRONG&gt;later event-time data&lt;/STRONG&gt; to advance the watermark past the end of the last window.&lt;/LI&gt;
&lt;LI class="p8i6j0a"&gt;If your last record is on &lt;CODE class="p8i6j0f"&gt;2026-04-30 23:00&lt;/CODE&gt;, there may be &lt;STRONG&gt;no later event&lt;/STRONG&gt; to push the watermark beyond the &lt;CODE class="p8i6j0f"&gt;2026-04-30&lt;/CODE&gt; daily window.&lt;/LI&gt;
&lt;LI class="p8i6j0a"&gt;&lt;CODE class="p8i6j0f"&gt;append&lt;/CODE&gt; + windowed aggregation is therefore a bad fit for &lt;STRONG&gt;bounded monthly backfills/files&lt;/STRONG&gt; like this.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 class="_9k2iva0 p8i6j0c _1ibi0s314 heading3 _9k2iva1"&gt;Recommendation:&lt;/H3&gt;
&lt;UL&gt;
&lt;LI&gt;Use batch instead of streaming&lt;/LI&gt;
&lt;LI&gt;If you want to use streaming, Use &lt;CODE data-end="2192" data-start="2168"&gt;outputMode("complete")&lt;/CODE&gt;&lt;/LI&gt;
&lt;/UL&gt;</description>
    <pubDate>Wed, 13 May 2026 04:36:41 GMT</pubDate>
    <dc:creator>Saritha_S</dc:creator>
    <dc:date>2026-05-13T04:36:41Z</dc:date>
    <item>
      <title>Streaming read and writing with aggregation</title>
      <link>https://community.databricks.com/t5/data-engineering/streaming-read-and-writing-with-aggregation/m-p/155848#M54323</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I have the following problem: on a medallion architecture on a bronze volume I get files every month containing the data for each sensor reading during the period 1 of month 00:00 to last day 23:00. I have a manual job that calls the python files to load and transform from landing -&amp;gt; bronze-&amp;gt; silver. Now I want to add the aggregated data to a gold table. I have the following code:&lt;/P&gt;&lt;P&gt;df_silver_swell_metrics = (&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; spark.readStream&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; .format("delta")&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; .table(f"cor_project.silver.swell_metrics")&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; .withWatermark("datetime", "0 second")&lt;/P&gt;&lt;P&gt;)&lt;/P&gt;&lt;P&gt;df_silver_swell_metrics_transformed = (&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; df_silver_swell_metrics&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; .groupBy(&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; F.window("datetime", "1 day").alias("window"),&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; "coast_name"&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; ).agg(&lt;/P&gt;&lt;P&gt;...&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; )&lt;/P&gt;&lt;P&gt;)&lt;/P&gt;&lt;P&gt;df_gold_wave_daily_summary = (&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; df_silver_swell_metrics_transformed&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; .select(&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; ...&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; )&lt;/P&gt;&lt;P&gt;)&lt;/P&gt;&lt;P&gt;(&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; df_gold_wave_daily_summary.writeStream&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; .format("delta")&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; .outputMode("append")&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; .trigger(availableNow=True)&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; .option("checkpointLocation", f"/Volumes/cor_{ambiente}/gold/data/checkpoints/wave_daily_summary")&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; .toTable(f"cor_project.gold.wave_daily_summary")&lt;/P&gt;&lt;P&gt;)&lt;/P&gt;&lt;P&gt;This works it doesn't repeat data that has already been processed&amp;nbsp; but it always misses the last day of the month. I have asked in other groups and they told me to use batch processing. Any thoughts on posible solutions?&lt;/P&gt;&lt;P&gt;Thanks for any comment&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 30 Apr 2026 05:41:03 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/streaming-read-and-writing-with-aggregation/m-p/155848#M54323</guid>
      <dc:creator>Guillermo-HR</dc:creator>
      <dc:date>2026-04-30T05:41:03Z</dc:date>
    </item>
    <item>
      <title>Re: Streaming read and writing with aggregation</title>
      <link>https://community.databricks.com/t5/data-engineering/streaming-read-and-writing-with-aggregation/m-p/156736#M54474</link>
      <description>&lt;P class="p8i6j01 paragraph"&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/228755"&gt;@Guillermo-HR&lt;/a&gt;&amp;nbsp;&lt;/P&gt;
&lt;P class="p8i6j01 paragraph"&gt;Yes — &lt;STRONG&gt;batch is usually the right fix here&lt;/STRONG&gt;.&lt;/P&gt;
&lt;P class="p8i6j01 paragraph"&gt;What’s happening is that your query is using &lt;STRONG&gt;event-time window aggregation in Structured Streaming with &lt;CODE class="p8i6j0f"&gt;append&lt;/CODE&gt; output mode&lt;/STRONG&gt;. In that mode, Spark only emits a window &lt;STRONG&gt;after it is sure the window is closed&lt;/STRONG&gt; according to the watermark. With monthly files and &lt;CODE class="p8i6j0f"&gt;availableNow=True&lt;/CODE&gt;, the &lt;STRONG&gt;final day’s 1-day window often never gets finalized during that run&lt;/STRONG&gt;, so it stays buffered and appears “missing”.&lt;/P&gt;
&lt;P class="p8i6j01 paragraph"&gt;A few key points:&lt;/P&gt;
&lt;UL class="p8i6j07 p8i6j02"&gt;
&lt;LI class="p8i6j0a"&gt;&lt;CODE class="p8i6j0f"&gt;withWatermark("datetime", "0 second")&lt;/CODE&gt; does &lt;STRONG&gt;not&lt;/STRONG&gt; mean “emit immediately”; it still depends on seeing &lt;STRONG&gt;later event-time data&lt;/STRONG&gt; to advance the watermark past the end of the last window.&lt;/LI&gt;
&lt;LI class="p8i6j0a"&gt;If your last record is on &lt;CODE class="p8i6j0f"&gt;2026-04-30 23:00&lt;/CODE&gt;, there may be &lt;STRONG&gt;no later event&lt;/STRONG&gt; to push the watermark beyond the &lt;CODE class="p8i6j0f"&gt;2026-04-30&lt;/CODE&gt; daily window.&lt;/LI&gt;
&lt;LI class="p8i6j0a"&gt;&lt;CODE class="p8i6j0f"&gt;append&lt;/CODE&gt; + windowed aggregation is therefore a bad fit for &lt;STRONG&gt;bounded monthly backfills/files&lt;/STRONG&gt; like this.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 class="_9k2iva0 p8i6j0c _1ibi0s314 heading3 _9k2iva1"&gt;Recommendation:&lt;/H3&gt;
&lt;UL&gt;
&lt;LI&gt;Use batch instead of streaming&lt;/LI&gt;
&lt;LI&gt;If you want to use streaming, Use &lt;CODE data-end="2192" data-start="2168"&gt;outputMode("complete")&lt;/CODE&gt;&lt;/LI&gt;
&lt;/UL&gt;</description>
      <pubDate>Wed, 13 May 2026 04:36:41 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/streaming-read-and-writing-with-aggregation/m-p/156736#M54474</guid>
      <dc:creator>Saritha_S</dc:creator>
      <dc:date>2026-05-13T04:36:41Z</dc:date>
    </item>
  </channel>
</rss>

