Streaming Table data leakage to historical permanent table

Akash_Varuna — Wed, 18 Feb 2026 13:01:48 GMT

Data Leakage in Historical Table from Streaming Table

Environment

Platform: Azure Databricks + Azure Event Hubs
Streaming Framework: Spark Structured Streaming
Storage: Delta Lake

Pipeline

Event Hubs → stream_messages (live 24hr rolling window) → messages

The load_messages notebook reads from stream_messages as a static batch inside foreachBatch, manually tracking its position using technical_sequenceNumber per technical_partition, stored in last_seq_nums_per_partition. It applies an EXPIRING_TIME filter of 30 minutes where now is derived from max(context_updatedAt) of the current micro-batch:

python

now = microbatch_df.agg(F.max('context_updatedAt')).collect()[0][0]

delayed_stream_increment = (
    spark.read.table(input_stream_table_name)
    .where(
        reduce(
            lambda x, y: x | y,
            [
                (F.col('technical_partition') == partition) & (F.col('technical_sequenceNumber') > last_record_number)
                for partition, last_record_number in last_seq_nums_per_partition.items()
            ]
        )
    )
    .where(F.col('context_updatedAt') < now - EXPIRING_TIME)
)

The Problem

There is a scheduled maintenance window for optimizing the messages table. During this window the load_messages job is paused. When comparing counts between stream_messages and messages for the affected dates, we see discrepancies for some days and some days there is no discrepancies

Example for today (2026-02-17), stream_messages and messages have the same count — confirming the leakage is tied specifically to the optimization/maintenance window dates.

I have looked and am not sure is it because of the Optimizing Task which causes this leakage because the micro batch is dropped?

Please if you have any idea on alternatives to avoid this please do let me know and the possible reason for this in current set up.

Thanks

Hi @Akash_Varuna, The count discrepancies you are seeing...

SteveOstrowski — Mon, 09 Mar 2026 05:04:20 GMT

Hi @Akash_Varuna,

The count discrepancies you are seeing between stream_messages and messages are almost certainly caused by the 24-hour rolling window on your stream_messages table expiring data while the load_messages job is paused during your maintenance window.

Here is what is happening step by step:

WHAT IS CAUSING THE DATA LOSS

1. Your load_messages job is paused for the maintenance window.
2. While paused, the stream_messages table continues its 24-hour rolling window lifecycle. Events that arrived more than 24 hours ago are removed from stream_messages.
3. When load_messages resumes, it picks up from the last tracked technical_sequenceNumber per partition. But some of the events it has not yet processed have already been aged out of stream_messages, so they are no longer available to read.
4. Those "missing" events never make it into the messages table, creating the count discrepancy.

This explains why days without a maintenance pause show matching counts (load_messages keeps up and processes events before they expire), while maintenance window days show gaps.

The OPTIMIZE operation on the messages table is not the cause. OPTIMIZE on a Delta table does not modify data. It only compacts small files into larger ones, and Delta Lake's snapshot isolation protects both streaming and batch readers during this process. You can safely rule that out.

RECOMMENDED SOLUTIONS

Option 1: Extend the stream_messages retention window

Increase the rolling window on stream_messages to be significantly longer than your maximum maintenance window duration. If your maintenance window is, say, 2 hours, a 24-hour window should be sufficient in theory. But if the window is approaching or exceeding the retention period, extend it (for example, to 48 or 72 hours). This gives you a safety margin so that no events expire before the resumed job can process them.

Option 2: Avoid pausing load_messages during maintenance

Since OPTIMIZE on the messages table is safe for concurrent readers and writers (Delta Lake guarantees this via MVCC), you do not need to pause load_messages during the optimization. You can run OPTIMIZE on the messages table while load_messages continues to operate. The Structured Streaming job will not be affected.

From the Databricks documentation on OPTIMIZE:
"Performing OPTIMIZE on a table that is a streaming source does not affect any current or future streams that treat this table as a source. Readers of Delta tables use snapshot isolation, which means that they are not interrupted when OPTIMIZE removes unnecessary files from the transaction log."

https://docs.databricks.com/en/delta/optimize.html

If you are also running VACUUM on the messages table, that is also safe for concurrent operations as long as the retention period is set appropriately (default 7 days). There is no need to pause writers.

Option 3: Switch to a true streaming read instead of batch read from stream_messages

Your current pattern reads from stream_messages using spark.read.table() (batch mode) inside foreachBatch, with manual sequence number tracking. A more resilient approach would be to use spark.readStream from the upstream source (Event Hubs) directly, letting Spark Structured Streaming manage checkpoints and offsets automatically. This gives you:

- Automatic offset tracking via checkpoints (no manual sequence number management)
- Exactly-once processing guarantees to the Delta sink (when writing to Delta with foreachBatch and using idempotent writes, or using a direct Delta writeStream)
- No dependency on the stream_messages rolling window

This would look something like:

(spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "<your-event-hubs-endpoint>:9093")
.option("subscribe", "<your-topic>")
.option("startingOffsets", "earliest")
.option("failOnDataLoss", "false")
.load()
.writeStream
.format("delta")
.option("checkpointLocation", "<checkpoint-path>")
.trigger(availableNow=True)
.toTable("messages")
)

With this approach, the streaming checkpoint tracks exactly which offsets have been consumed. Even if the job is paused and restarted, it resumes from the last committed checkpoint offset, and there is no dependency on the stream_messages intermediate table retaining the data.

For Event Hubs specifically, make sure your Event Hubs retention period is longer than your maximum expected downtime so that offsets remain available when the job restarts. Azure Event Hubs retention is configurable from 1 to 90 days (or unlimited with the Premium/Dedicated tier).

https://docs.databricks.com/en/connect/streaming/kafka.html

Option 4: Add a safety buffer to your EXPIRING_TIME filter

If you keep the current architecture, consider making the EXPIRING_TIME filter aware of the maintenance window. Instead of a fixed 30-minute delay, you could track the last successful run timestamp and use it to ensure you process all records that arrived since the last run, regardless of how long the pause was.

SUMMARY

The most impactful fix is Option 2: simply stop pausing load_messages during the OPTIMIZE maintenance window. Delta Lake handles concurrent OPTIMIZE and writes safely. If you do need to pause, extend the stream_messages retention (Option 1) to cover the gap.

For a longer-term improvement, Option 3 (reading directly from Event Hubs with checkpoint-based offset tracking) eliminates the rolling window dependency entirely and gives you the most robust pipeline.

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.