topic Re: How to push late data to DLQ in pyspark structured streaming ? in Data Engineering

How to push late data to DLQ in pyspark structured streaming ?

PreetiB — Mon, 28 Apr 2025 12:13:12 GMT

Re: How to push late data to DLQ in pyspark structured streaming ?

lingareddy_Alva — Mon, 28 Apr 2025 17:00:24 GMT

pushing late-arriving data into a Dead Letter Queue (DLQ) is very common in PySpark Structured Streaming, especially in real-time pipelines.

1. Set watermarking on your stream to define "late data."

Example:
df_with_event_time = df.withWatermark("event_time_column", "5 minutes")

2. Split the stream into:
On-time data
Late data
You can do this by comparing current event time vs. watermark (manually).
Example:
from pyspark.sql import functions as F

WATERMARK_DELAY_MINUTES = 5

current_time = F.current_timestamp()

# Create a flag to identify late data
df_with_flags = df_with_event_time.withColumn(
"is_late",
F.when(
F.col("event_time_column") < (current_time - F.expr(f"INTERVAL {WATERMARK_DELAY_MINUTES} minutes")),
F.lit(True)
).otherwise(F.lit(False))
)

3. Filter into two DataFrames:
# Good data
df_on_time = df_with_flags.filter(~F.col("is_late"))
# Late data to push to DLQ
df_late = df_with_flags.filter(F.col("is_late"))

4. Write late data to DLQ (example: to a separate Delta table, or a blob location):
Example:
dlq_path = "/mnt/dlq/late_data/"

late_query = (
df_late.writeStream
.format("delta") # or "parquet", "json"
.option("checkpointLocation", "/mnt/checkpoints/late_data/")
.outputMode("append")
.start(dlq_path)
)

✅Now late data is saved safely for further analysis without poisoning your main pipeline.

5. Write good data normally to your trusted sink:
Example:
good_query = (
df_on_time.writeStream
.format("delta")
.option("checkpointLocation", "/mnt/checkpoints/good_data/")
.outputMode("append")
.start("/mnt/good_data/")
)

If you want to auto-drop too-late data silently, you can enable:
.option("dropLateData", "true")
(Available for certain streaming sources like Auto Loader or Kafka in Databricks.)

Re: How to push late data to DLQ in pyspark structured streaming ?

Louis_Frolio — Mon, 28 Apr 2025 17:52:54 GMT

Windowing/Watermarking is your friend here 🙂

Re: How to push late data to DLQ in pyspark structured streaming ?

PreetiB — Wed, 30 Apr 2025 16:25:06 GMT

Filtering late data based on current timestamp is not acceptable as watermarking is based on max timestamp of events in batch - threshold for delay.

So , my objective is to get late data based on watermark value.

Currently I am struggling to get watermark value in application so that I can filter late data.

We can get watermark value from streaming querylistener object but we cannot do heavy processing like filtering in listener as it raises insufficient memory issue

Re: How to push late data to DLQ in pyspark structured streaming ?

PreetiB — Wed, 30 Apr 2025 16:27:44 GMT

Thanks for your response and I also want to add below problem that I m facing currently

Re: How to push late data to DLQ in pyspark structured streaming ?

lingareddy_Alva — Wed, 30 Apr 2025 20:36:06 GMT

@PreetiB

Watermarking in Spark Structured Streaming defines how late an event can arrive before it's considered "too late,"
but it doesn't directly expose the current watermark value to use within your main processing logic.
This makes it tricky when you want to filter late data based on the watermark value.

You're also correct that using StreamingQueryListener to capture the watermark value and then doing heavy processing in
it is problematic due to memory and architecture limitations (listeners are for monitoring, not processing).

Because watermarking is handled internally by Spark, there’s no clean public API to pull the watermark into transformations directly.
The workaround using a side channel (temp view, broadcast, external KV store) is the safest way to make that value accessible within your transformation logic.

Re: How to push late data to DLQ in pyspark structured streaming ?

PreetiB — Sat, 03 May 2025 07:19:55 GMT

Hi LRALVA,

Can you please describe the workaround in detail so that I can implement the same