Hello,
I am running spark structured streaming, reading from one table table_1, do some aggregation and then write results to another table. table_1 is partitioned by ["datehour", "customerID"]
My code is like this:
spark.readStream
.format("delta")
.table("table_1")
.withWatermark("datehour", "24 hours")
.filter((col("datehour")>="2023-11-27"))
....
I define run the workflow by job with tasks.
But the filtering doesn't works as I expected. The streaming goes through all the rows, which are two years of events before it could find the events that meet the filter. How can I let the streaming start directly from the "datehour" where the filter speicifies?