cancel
Showing results for 
Search instead for 
Did you mean: 
Warehousing & Analytics
Engage in discussions on data warehousing, analytics, and BI solutions within the Databricks Community. Share insights, tips, and best practices for leveraging data for informed decision-making.
cancel
Showing results for 
Search instead for 
Did you mean: 

Why does readStream filter go through all records?

Jennifer
New Contributor III

Hello,

I am running spark structured streaming, reading from one table table_1, do some aggregation and then write results to another table. table_1 is partitioned by ["datehour", "customerID"]

My code is like this:

spark.readStream
.format("delta")
.table("table_1")
.withWatermark("datehour", "24 hours")
.filter((col("datehour")>="2023-11-27"))
....
I define run the workflow by job with tasks.
But the filtering doesn't works as I expected. The streaming goes through all the rows, which are two years of events before it could find the events that meet the filter. How can I let the streaming start directly from the "datehour" where the filter speicifies?
1 ACCEPTED SOLUTION

Accepted Solutions

-werners-
Esteemed Contributor III
2 REPLIES 2

-werners-
Esteemed Contributor III

Jennifer
New Contributor III

Thanks:-)

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group