<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Why does readStream filter go through all records? in Warehousing &amp; Analytics</title>
    <link>https://community.databricks.com/t5/warehousing-analytics/why-does-readstream-filter-go-through-all-records/m-p/54193#M1087</link>
    <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;I am running spark structured streaming, reading from one table &lt;SPAN&gt;table_1&lt;/SPAN&gt;, do some aggregation and then write results to another table. table_1 is partitioned by ["datehour", "customerID"]&lt;/P&gt;&lt;P&gt;My code is like this:&lt;/P&gt;&lt;DIV&gt;&lt;SPAN&gt;spark.readStream&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;SPAN&gt;format&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"delta"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;SPAN&gt;table&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"table_1"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;SPAN&gt;withWatermark&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"datehour"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;"24 hours"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;SPAN&gt;filter&lt;/SPAN&gt;&lt;SPAN&gt;((&lt;/SPAN&gt;&lt;SPAN&gt;col&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"datehour"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;SPAN&gt;&amp;gt;=&lt;/SPAN&gt;&lt;SPAN&gt;"2023-11-27"&lt;/SPAN&gt;&lt;SPAN&gt;))&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;....&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;I define run the workflow by job with tasks. &lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;But the filtering doesn't works as I expected. The streaming goes through all the rows, which are two years of events before it could find the events that meet the filter. How can I let the streaming start directly from the "datehour" where the filter speicifies?&lt;/SPAN&gt;&lt;/DIV&gt;</description>
    <pubDate>Wed, 29 Nov 2023 10:17:16 GMT</pubDate>
    <dc:creator>Jennifer</dc:creator>
    <dc:date>2023-11-29T10:17:16Z</dc:date>
    <item>
      <title>Why does readStream filter go through all records?</title>
      <link>https://community.databricks.com/t5/warehousing-analytics/why-does-readstream-filter-go-through-all-records/m-p/54193#M1087</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;I am running spark structured streaming, reading from one table &lt;SPAN&gt;table_1&lt;/SPAN&gt;, do some aggregation and then write results to another table. table_1 is partitioned by ["datehour", "customerID"]&lt;/P&gt;&lt;P&gt;My code is like this:&lt;/P&gt;&lt;DIV&gt;&lt;SPAN&gt;spark.readStream&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;SPAN&gt;format&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"delta"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;SPAN&gt;table&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"table_1"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;SPAN&gt;withWatermark&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"datehour"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;"24 hours"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;SPAN&gt;filter&lt;/SPAN&gt;&lt;SPAN&gt;((&lt;/SPAN&gt;&lt;SPAN&gt;col&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"datehour"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;SPAN&gt;&amp;gt;=&lt;/SPAN&gt;&lt;SPAN&gt;"2023-11-27"&lt;/SPAN&gt;&lt;SPAN&gt;))&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;....&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;I define run the workflow by job with tasks. &lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;But the filtering doesn't works as I expected. The streaming goes through all the rows, which are two years of events before it could find the events that meet the filter. How can I let the streaming start directly from the "datehour" where the filter speicifies?&lt;/SPAN&gt;&lt;/DIV&gt;</description>
      <pubDate>Wed, 29 Nov 2023 10:17:16 GMT</pubDate>
      <guid>https://community.databricks.com/t5/warehousing-analytics/why-does-readstream-filter-go-through-all-records/m-p/54193#M1087</guid>
      <dc:creator>Jennifer</dc:creator>
      <dc:date>2023-11-29T10:17:16Z</dc:date>
    </item>
    <item>
      <title>Re: Why does readStream filter go through all records?</title>
      <link>https://community.databricks.com/t5/warehousing-analytics/why-does-readstream-filter-go-through-all-records/m-p/54201#M1088</link>
      <description>&lt;P&gt;To define the initial position please check this:&lt;/P&gt;&lt;P&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/databricks/structured-streaming/delta-lake#specify-initial-position" target="_blank"&gt;https://learn.microsoft.com/en-us/azure/databricks/structured-streaming/delta-lake#specify-initial-position&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 29 Nov 2023 11:24:03 GMT</pubDate>
      <guid>https://community.databricks.com/t5/warehousing-analytics/why-does-readstream-filter-go-through-all-records/m-p/54201#M1088</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2023-11-29T11:24:03Z</dc:date>
    </item>
    <item>
      <title>Re: Why does readStream filter go through all records?</title>
      <link>https://community.databricks.com/t5/warehousing-analytics/why-does-readstream-filter-go-through-all-records/m-p/54202#M1089</link>
      <description>&lt;P&gt;Thanks:-)&lt;/P&gt;</description>
      <pubDate>Wed, 29 Nov 2023 11:38:40 GMT</pubDate>
      <guid>https://community.databricks.com/t5/warehousing-analytics/why-does-readstream-filter-go-through-all-records/m-p/54202#M1089</guid>
      <dc:creator>Jennifer</dc:creator>
      <dc:date>2023-11-29T11:38:40Z</dc:date>
    </item>
  </channel>
</rss>

