Why does readStream filter go through all records?

Jennifer — Wed, 29 Nov 2023 10:17:16 GMT

Hello,

I am running spark structured streaming, reading from one table table_1, do some aggregation and then write results to another table. table_1 is partitioned by ["datehour", "customerID"]

My code is like this:

spark.readStream

.format("delta")

.table("table_1")

.withWatermark("datehour", "24 hours")

.filter((col("datehour")>="2023-11-27"))

....

I define run the workflow by job with tasks.

But the filtering doesn't works as I expected. The streaming goes through all the rows, which are two years of events before it could find the events that meet the filter. How can I let the streaming start directly from the "datehour" where the filter speicifies?

Re: Why does readStream filter go through all records?

-werners- — Wed, 29 Nov 2023 11:24:03 GMT

To define the initial position please check this:

https://learn.microsoft.com/en-us/azure/databricks/structured-streaming/delta-lake#specify-initial-position

Re: Why does readStream filter go through all records?

Jennifer — Wed, 29 Nov 2023 11:38:40 GMT

Thanks:-)

topic Why does readStream filter go through all records? in Warehousing & Analytics

Why does readStream filter go through all records?

Re: Why does readStream filter go through all records?

Re: Why does readStream filter go through all records?