<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: DictionaryFilters Pushdown on Views in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/dictionaryfilters-pushdown-on-views/m-p/125412#M47442</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/175345"&gt;@Steffen&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;This happens because you're applying some functions to ts attribute like FLOOR, from_unix_timestamp etc.,&amp;nbsp;which hides the raw ts from Spark's optimizer, so it can’t push down filters.&lt;BR /&gt;&lt;BR /&gt;If you can, try to add additional attribute to your upstream table with below convertion:&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;to_timestamp(from_unixtime(FLOOR((ts - 1) / 60) * 60 + 60)) &lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;Then you won't need to use any functions in your view and it should help the optimizer do its job.&lt;/P&gt;</description>
    <pubDate>Wed, 16 Jul 2025 10:00:34 GMT</pubDate>
    <dc:creator>szymon_dybczak</dc:creator>
    <dc:date>2025-07-16T10:00:34Z</dc:date>
    <item>
      <title>DictionaryFilters Pushdown on Views</title>
      <link>https://community.databricks.com/t5/data-engineering/dictionaryfilters-pushdown-on-views/m-p/125397#M47441</link>
      <description>&lt;P&gt;Hello&lt;/P&gt;&lt;P&gt;I have a very simple table with time series data with three columns:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;id (long): unique id of signal&lt;/LI&gt;&lt;LI&gt;ts (unix timestamp): timestamp of the event in unix timestamp format&lt;/LI&gt;&lt;LI&gt;value (double): value of the signal at the given timestamp&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;For every second there is one entry for each signal.&lt;/P&gt;&lt;P&gt;I want to create a view with 1 minute average of the signals but I am having trouble to get the&amp;nbsp;DictionaryFilters to push down the ts filter:&lt;/P&gt;&lt;P&gt;My view is currently defined like this:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;SELECT
  to_timestamp(from_unixtime(FLOOR((ts - 1) / 60) * 60 + 60)) AS ts
  id,
  AVG(value) AS avg
FROM
  measurements
GROUP BY
  id,
  FLOOR((ts - 1) / 60)&lt;/LI-CODE&gt;&lt;P&gt;If I make a select on this view with where filter on signal_id and ts the plan shows that the filter is only pushed down for the id:&lt;BR /&gt;DictionaryFilters: [(id#10256L = 1234)]&lt;/P&gt;&lt;P&gt;But I make a query directly on the base table and make the where filter there is is pushed down:&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;DictionaryFilters: [(id#10917L = 1234), ((ts#10920L &amp;gt;= 1751328000) AND (ts#10920L &amp;lt;= 1751414400))]&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;I tried to adapt the view to not use group by, but with no success. Everytime I use some sort of AVG() Over or something else, the ts filter is not pushed down anymore.&lt;BR /&gt;&lt;BR /&gt;Is there a way to create a view where I can set filter with where ts between x and x that will be pushed down to the&amp;nbsp;DictionaryFilters?&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Thanks&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Steffen&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 16 Jul 2025 07:18:32 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dictionaryfilters-pushdown-on-views/m-p/125397#M47441</guid>
      <dc:creator>Steffen</dc:creator>
      <dc:date>2025-07-16T07:18:32Z</dc:date>
    </item>
    <item>
      <title>Re: DictionaryFilters Pushdown on Views</title>
      <link>https://community.databricks.com/t5/data-engineering/dictionaryfilters-pushdown-on-views/m-p/125412#M47442</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/175345"&gt;@Steffen&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;This happens because you're applying some functions to ts attribute like FLOOR, from_unix_timestamp etc.,&amp;nbsp;which hides the raw ts from Spark's optimizer, so it can’t push down filters.&lt;BR /&gt;&lt;BR /&gt;If you can, try to add additional attribute to your upstream table with below convertion:&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;to_timestamp(from_unixtime(FLOOR((ts - 1) / 60) * 60 + 60)) &lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;Then you won't need to use any functions in your view and it should help the optimizer do its job.&lt;/P&gt;</description>
      <pubDate>Wed, 16 Jul 2025 10:00:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dictionaryfilters-pushdown-on-views/m-p/125412#M47442</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2025-07-16T10:00:34Z</dc:date>
    </item>
    <item>
      <title>Re: DictionaryFilters Pushdown on Views</title>
      <link>https://community.databricks.com/t5/data-engineering/dictionaryfilters-pushdown-on-views/m-p/125419#M47443</link>
      <description>&lt;P&gt;I dont think the to_timestamp() is the function but the group by or more precise the arregate over ts?&lt;/P&gt;&lt;P&gt;Because even if I keep the ts as raw this does not work.&lt;/P&gt;&lt;P&gt;E.G. with this view definition, it still does not push down the ts filer:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;SELECT
    id,
    ts,
    AVG(value) OVER (ORDER BY ts ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS avg_value,
FROM
    measurements&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 16 Jul 2025 11:00:13 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dictionaryfilters-pushdown-on-views/m-p/125419#M47443</guid>
      <dc:creator>Steffen</dc:creator>
      <dc:date>2025-07-16T11:00:13Z</dc:date>
    </item>
    <item>
      <title>Re: DictionaryFilters Pushdown on Views</title>
      <link>https://community.databricks.com/t5/data-engineering/dictionaryfilters-pushdown-on-views/m-p/125441#M47446</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/175345"&gt;@Steffen&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;To_timestamp() is a function. Let me show you on example. I recreated your data as you can see in below screen:&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="szymon_dybczak_0-1752672125836.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/18226iF9AE5765832D2639/image-size/medium?v=v2&amp;amp;px=400" role="button" title="szymon_dybczak_0-1752672125836.png" alt="szymon_dybczak_0-1752672125836.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;Now, for the sake of example I didn't apply any functions to attibutes and the filter pushdown works as expected:&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;df = spark.sql("""
    SELECT
        ts,
        id,
        AVG(value) AS avg
    FROM raw_measurements  
    GROUP BY
    id,
    ts
""")

df.createOrReplaceTempView("test_view")

query = f"""
SELECT *
FROM test_view
WHERE id = 1 AND ts BETWEEN 1751328004 AND 1751328104
"""
#display(spark.sql(query))
# Show logical/physical plan to inspect pushdown
spark.sql(query).explain(True)&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="szymon_dybczak_1-1752672472215.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/18229iA6413713BE0D6F77/image-size/medium?v=v2&amp;amp;px=400" role="button" title="szymon_dybczak_1-1752672472215.png" alt="szymon_dybczak_1-1752672472215.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;Now, let's try to apply to_timestamp function to ts attribute:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;df = spark.sql("""
    SELECT    
        id,    
        to_timestamp(from_unixtime(FLOOR((ts - 1) / 60) * 60 + 60)) as ts,
        AVG(value) AS avg
    FROM raw_measurements  
    GROUP BY
    id,
    ts
""")

df.createOrReplaceTempView("test_view")

query = f"""
SELECT *
FROM test_view
WHERE id = 1 AND ts BETWEEN  '2025-07-01T00:54:00.000+00:00' AND '2025-07-02T00:54:00.000+00:00'
"""
#display(spark.sql(query))
# Show logical/physical plan to inspect pushdown
spark.sql(query).explain(True)&lt;/LI-CODE&gt;&lt;P&gt;As you can see, applying simple function to an attribute can prevent Spark SQL optimizer to kick in:&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="szymon_dybczak_2-1752672760266.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/18230iEB44FB1B66B5EC34/image-size/medium?v=v2&amp;amp;px=400" role="button" title="szymon_dybczak_2-1752672760266.png" alt="szymon_dybczak_2-1752672760266.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;Your second example is a bit different though:&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;SELECT
    id,
    ts,
    AVG(value) OVER (ORDER BY ts ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS avg_value,
FROM
    measurements&lt;/LI-CODE&gt;&lt;P&gt;Here I believe you encounter similar case to one described at below link:&lt;BR /&gt;&lt;BR /&gt;&lt;A href="https://issues.apache.org/jira/browse/SPARK-23985" target="_blank" rel="noopener"&gt;[SPARK-23985] predicate push down doesn't work with simple compound partition spec - ASF JIRA&lt;/A&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;SPAN&gt;In short, filters are getting pushed only if they appear in the partitionSpec of window function. So, when you're using it like this:&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;df = spark.sql("""
    SELECT
        id,
        ts,
        AVG(value) OVER (ORDER BY ts ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS avg_value
    FROM
        raw_measurements
""")

df.createOrReplaceTempView("test_view")


query = f"""
SELECT *
FROM test_view
WHERE id = 1 AND ts BETWEEN 1751328004 AND 1751328104
"""
#display(spark.sql(query))
# Show logical/physical plan to inspect pushdown
spark.sql(query).explain(True)&lt;/LI-CODE&gt;&lt;P&gt;&lt;SPAN&gt;&lt;BR /&gt;Then pushed filters don't work:&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="szymon_dybczak_3-1752673059756.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/18231iEA278D81C4A95B99/image-size/medium?v=v2&amp;amp;px=400" role="button" title="szymon_dybczak_3-1752673059756.png" alt="szymon_dybczak_3-1752673059756.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;But when you add ts attribute to partition by clause then optimizer will do its job:&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="szymon_dybczak_4-1752673160387.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/18232iD0AFCCD4F3F87320/image-size/medium?v=v2&amp;amp;px=400" role="button" title="szymon_dybczak_4-1752673160387.png" alt="szymon_dybczak_4-1752673160387.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="szymon_dybczak_5-1752673195410.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/18233i2FF96DB15696D107/image-size/medium?v=v2&amp;amp;px=400" role="button" title="szymon_dybczak_5-1752673195410.png" alt="szymon_dybczak_5-1752673195410.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;And finally, what I recommended to try was to calculate at upstream table following attribute:&lt;/SPAN&gt;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;to_timestamp(from_unixtime(FLOOR((ts - 1) / 60) * 60 + 60)) as ts_aligned&lt;/LI-CODE&gt;&lt;P&gt;&lt;SPAN&gt;And then create a view:&lt;/SPAN&gt;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;df = spark.sql("""
    SELECT
        ts_aligned,
        id,
        AVG(value) AS avg
    FROM
        raw_measurements
    GROUP BY
    id,
    ts_aligned
""")

df.createOrReplaceTempView("test_view")&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Now, optimizer is again able to push filter to source:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="szymon_dybczak_6-1752673811801.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/18234iF684B1D7C5CC39E9/image-size/medium?v=v2&amp;amp;px=400" role="button" title="szymon_dybczak_6-1752673811801.png" alt="szymon_dybczak_6-1752673811801.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 16 Jul 2025 13:50:17 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dictionaryfilters-pushdown-on-views/m-p/125441#M47446</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2025-07-16T13:50:17Z</dc:date>
    </item>
    <item>
      <title>Re: DictionaryFilters Pushdown on Views</title>
      <link>https://community.databricks.com/t5/data-engineering/dictionaryfilters-pushdown-on-views/m-p/125462#M47451</link>
      <description>&lt;P&gt;Thank you for the detailed explanation!&amp;nbsp;&lt;/P&gt;&lt;P&gt;Our upstream table is very big (300.000+ signals and over 5 years of data). Adding a new column and with that rewrite the whole table... In the end we would like to have multiple views based on different time intervals to resample, eg. 60 sec, 15 min, 60 min, 1 day.&lt;BR /&gt;Than we would need to add all the aligned timestamps as columns in the upstream table. If we would want to add a new interval, this would mean to rewrite the whole table...&lt;/P&gt;&lt;P&gt;I could not come up with a way to get the avg by timespan without using some sort of function (like floor) on the ts column, so I guess to achieved our needs, we would need the ts_aligned columns in the table.&lt;/P&gt;&lt;P&gt;As the workaround is to just not create views but always query on the measurements table directly, we need to consider the cost of the rewriting of the measurement table with the new columns.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 16 Jul 2025 15:00:47 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dictionaryfilters-pushdown-on-views/m-p/125462#M47451</guid>
      <dc:creator>Steffen</dc:creator>
      <dc:date>2025-07-16T15:00:47Z</dc:date>
    </item>
  </channel>
</rss>

