cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Streaming read and writing with aggregation

Guillermo-HR
New Contributor

Hi,

I have the following problem: on a medallion architecture on a bronze volume I get files every month containing the data for each sensor reading during the period 1 of month 00:00 to last day 23:00. I have a manual job that calls the python files to load and transform from landing -> bronze-> silver. Now I want to add the aggregated data to a gold table. I have the following code:

df_silver_swell_metrics = (

    spark.readStream

        .format("delta")

        .table(f"cor_project.silver.swell_metrics")

        .withWatermark("datetime", "0 second")

)

df_silver_swell_metrics_transformed = (

    df_silver_swell_metrics

    .groupBy(

        F.window("datetime", "1 day").alias("window"),

        "coast_name"

    ).agg(

...

            )

)

df_gold_wave_daily_summary = (

    df_silver_swell_metrics_transformed

    .select(

        ...

    )

)

(

    df_gold_wave_daily_summary.writeStream

        .format("delta")

        .outputMode("append")

        .trigger(availableNow=True)

        .option("checkpointLocation", f"/Volumes/cor_{ambiente}/gold/data/checkpoints/wave_daily_summary")

        .toTable(f"cor_project.gold.wave_daily_summary")

)

This works it doesn't repeat data that has already been processed  but it always misses the last day of the month. I have asked in other groups and they told me to use batch processing. Any thoughts on posible solutions?

Thanks for any comment 

1 REPLY 1

Saritha_S
Databricks Employee
Databricks Employee

Hi @Guillermo-HR 

Yes — batch is usually the right fix here.

What’s happening is that your query is using event-time window aggregation in Structured Streaming with append output mode. In that mode, Spark only emits a window after it is sure the window is closed according to the watermark. With monthly files and availableNow=True, the final day’s 1-day window often never gets finalized during that run, so it stays buffered and appears “missing”.

A few key points:

  • withWatermark("datetime", "0 second") does not mean “emit immediately”; it still depends on seeing later event-time data to advance the watermark past the end of the last window.
  • If your last record is on 2026-04-30 23:00, there may be no later event to push the watermark beyond the 2026-04-30 daily window.
  • append + windowed aggregation is therefore a bad fit for bounded monthly backfills/files like this.

Recommendation:

  • Use batch instead of streaming
  • If you want to use streaming, Use outputMode("complete")