Streaming read and writing with aggregation

Guillermo-HR — Thu, 30 Apr 2026 05:41:03 GMT

Hi,

I have the following problem: on a medallion architecture on a bronze volume I get files every month containing the data for each sensor reading during the period 1 of month 00:00 to last day 23:00. I have a manual job that calls the python files to load and transform from landing -> bronze-> silver. Now I want to add the aggregated data to a gold table. I have the following code:

df_silver_swell_metrics = (

spark.readStream

.format("delta")

.table(f"cor_project.silver.swell_metrics")

.withWatermark("datetime", "0 second")

)

df_silver_swell_metrics_transformed = (

df_silver_swell_metrics

.groupBy(

F.window("datetime", "1 day").alias("window"),

"coast_name"

).agg(

...

)

df_gold_wave_daily_summary = (

df_silver_swell_metrics_transformed

.select(

...

)

(

df_gold_wave_daily_summary.writeStream

.format("delta")

.outputMode("append")

.trigger(availableNow=True)

.option("checkpointLocation", f"/Volumes/cor_{ambiente}/gold/data/checkpoints/wave_daily_summary")

.toTable(f"cor_project.gold.wave_daily_summary")

)

This works it doesn't repeat data that has already been processed but it always misses the last day of the month. I have asked in other groups and they told me to use batch processing. Any thoughts on posible solutions?

Thanks for any comment

Re: Streaming read and writing with aggregation

Saritha_S — Wed, 13 May 2026 04:36:41 GMT

Hi @Guillermo-HR

Yes — batch is usually the right fix here.

What’s happening is that your query is using event-time window aggregation in Structured Streaming with append output mode. In that mode, Spark only emits a window after it is sure the window is closed according to the watermark. With monthly files and availableNow=True, the final day’s 1-day window often never gets finalized during that run, so it stays buffered and appears “missing”.

A few key points:

withWatermark("datetime", "0 second") does not mean “emit immediately”; it still depends on seeing later event-time data to advance the watermark past the end of the last window.
If your last record is on 2026-04-30 23:00, there may be no later event to push the watermark beyond the 2026-04-30 daily window.
append + windowed aggregation is therefore a bad fit for bounded monthly backfills/files like this.

Recommendation:

Use batch instead of streaming
If you want to use streaming, Use outputMode("complete")

topic Re: Streaming read and writing with aggregation in Data Engineering

Streaming read and writing with aggregation

Re: Streaming read and writing with aggregation

Recommendation: