Databricks Community

Jennifer · ‎11-29-2023

Hello,

I am running spark structured streaming, reading from one table table_1, do some aggregation and then write results to another table. table_1 is partitioned by ["datehour", "customerID"]

My code is like this:

spark.readStream

.format("delta")

.table("table_1")

.withWatermark("datehour", "24 hours")

.filter((col("datehour")>="2023-11-27"))

....

I define run the workflow by job with tasks.

But the filtering doesn't works as I expected. The streaming goes through all the rows, which are two years of events before it could find the events that meet the filter. How can I let the streaming start directly from the "datehour" where the filter speicifies?