cancel
Showing results for 
Search instead for 
Did you mean: 
Warehousing & Analytics
cancel
Showing results for 
Search instead for 
Did you mean: 

Why does readStream filter go through all records?

Jennifer
New Contributor III

Hello,

I am running spark structured streaming, reading from one table table_1, do some aggregation and then write results to another table. table_1 is partitioned by ["datehour", "customerID"]

My code is like this:

spark.readStream
.format("delta")
.table("table_1")
.withWatermark("datehour", "24 hours")
.filter((col("datehour")>="2023-11-27"))
....
I define run the workflow by job with tasks.
But the filtering doesn't works as I expected. The streaming goes through all the rows, which are two years of events before it could find the events that meet the filter. How can I let the streaming start directly from the "datehour" where the filter speicifies?
1 ACCEPTED SOLUTION

Accepted Solutions
2 REPLIES 2

-werners-
Esteemed Contributor III

Jennifer
New Contributor III

Thanks:-)