cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Warehousing & Analytics
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Why does readStream filter go through all records?

Jennifer
New Contributor III

Hello,

I am running spark structured streaming, reading from one table table_1, do some aggregation and then write results to another table. table_1 is partitioned by ["datehour", "customerID"]

My code is like this:

spark.readStream
.format("delta")
.table("table_1")
.withWatermark("datehour", "24 hours")
.filter((col("datehour")>="2023-11-27"))
....
I define run the workflow by job with tasks.
But the filtering doesn't works as I expected. The streaming goes through all the rows, which are two years of events before it could find the events that meet the filter. How can I let the streaming start directly from the "datehour" where the filter speicifies?
1 ACCEPTED SOLUTION

Accepted Solutions

-werners-
Esteemed Contributor III
2 REPLIES 2

-werners-
Esteemed Contributor III

Jennifer
New Contributor III

Thanks:-)

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.