<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Liquid Clustering file pruning breaks when filtering on a high NULL numeric column in dataSkippi in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/liquid-clustering-file-pruning-breaks-when-filtering-on-a-high/m-p/156033#M54341</link>
    <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/120050"&gt;@aonurdemir&lt;/a&gt;&amp;nbsp;, I looked into your query and have compiled some helpful tips:&lt;/P&gt;
&lt;P&gt;I don't have direct access to your workspace internals, so I can't prove this definitively. But what you're seeing is consistent with how Delta's stats-based data skipping can get confused by a high-NULL numeric column.&lt;/P&gt;
&lt;P&gt;A few mechanics worth keeping in mind:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Data skipping is per-column, per-file. For each stats column, Delta keeps file-level min/max plus a few additional stats (like null count). When the planner sees a predicate on a column listed in &lt;CODE&gt;delta.dataSkippingStatsColumns&lt;/CODE&gt;, it uses those stats to decide which files might contain matching rows.&lt;/LI&gt;
&lt;LI&gt;When a stats column is very sparse (lots of NULLs), its stats are nearly non-discriminative. If many files have &lt;CODE&gt;integer_pageView&lt;/CODE&gt; = NULL for every row, their file-level stats look like "all NULLs." For predicates like &lt;CODE&gt;integer_pageView &amp;gt; 0&lt;/CODE&gt; or &lt;CODE&gt;IS NOT NULL&lt;/CODE&gt;, the engine can't safely prove those files have no matches, so it keeps them.&lt;/LI&gt;
&lt;LI&gt;The subtle part is the interaction between columns. In your base query (without &lt;CODE&gt;integer_pageView&lt;/CODE&gt;), Delta combines &lt;CODE&gt;event_time&lt;/CODE&gt; and the other selective stats columns and prunes down to 303 files. The moment you reference &lt;CODE&gt;integer_pageView&lt;/CODE&gt; (a stats column with poor selectivity and many NULL-only files), the planner appears to fall back to a less effective skipping strategy and keeps 2,587 files. That's why the same pattern on &lt;CODE&gt;integer_earnPoints&lt;/CODE&gt; (not in stats) still returns 303 files. That predicate is applied row-level, after the good file pruning has already happened.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;I agree with your core conclusion:&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;high-NULL numeric column + being in &lt;CODE&gt;delta.dataSkippingStatsColumns&lt;/CODE&gt; can degrade overall file pruning, even when combined with a highly selective Liquid-clustered column like &lt;CODE&gt;event_time&lt;/CODE&gt;.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;Whether we call that a bug or a limitation of the current skipping algorithm is mostly semantics. From a user's point of view, it's surprising behavior, and it deserves attention.&lt;/P&gt;
&lt;H3&gt;Why the workaround works&lt;/H3&gt;
&lt;P&gt;Your workaround:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class="language-sql"&gt;ALTER TABLE a_big_table
  SET TBLPROPERTIES (
    'delta.dataSkippingStatsColumns' = 'event_time,integer_userId,key_appName,...(minus integer_pageView)'
  );
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;forces &lt;CODE&gt;integer_pageView&lt;/CODE&gt; back into being a row-level filter only. So:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;Delta uses good stats on &lt;CODE&gt;event_time&lt;/CODE&gt; and the other selective columns to prune down to roughly 303 files.&lt;/LI&gt;
&lt;LI&gt;&lt;CODE&gt;integer_pageView&lt;/CODE&gt; gets evaluated inside those files without disturbing the skipping plan.&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;That's exactly what you want.&lt;/P&gt;
&lt;H3&gt;Alternative pattern (credit to @amirabedhiafi)&lt;/H3&gt;
&lt;P&gt;The generated-column approach @amirabedhiafi suggested is a clean compromise:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class="language-sql"&gt;ALTER TABLE a_big_table
ADD COLUMN has_pageView BOOLEAN
  GENERATED ALWAYS AS (integer_pageView IS NOT NULL);

-- Or, if you only care about &amp;gt; 0:
ALTER TABLE a_big_table
ADD COLUMN pageView_positive BOOLEAN
  GENERATED ALWAYS AS (integer_pageView &amp;gt; 0);
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Then:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Add &lt;CODE&gt;has_pageView&lt;/CODE&gt; or &lt;CODE&gt;pageView_positive&lt;/CODE&gt; to &lt;CODE&gt;delta.dataSkippingStatsColumns&lt;/CODE&gt;.&lt;/LI&gt;
&lt;LI&gt;Filter on the helper column instead of the raw BIGINT:&lt;/LI&gt;
&lt;/UL&gt;
&lt;PRE&gt;&lt;CODE class="language-python"&gt;len(base.filter(col("pageView_positive") == True).inputFiles())
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;The helper column is boolean and doesn't suffer from the same "all values NULL" pattern. Its file-level stats are much more selective and play nicely with Liquid clustering and the other stats columns.&lt;/P&gt;
&lt;H3&gt;What to do next&lt;/H3&gt;
&lt;P&gt;You have a very clean reproduction and clear metrics (303 versus 2,587 files), so you're well positioned to get this in front of engineering. Do these in order:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;Open a Databricks Support ticket (or go through your account team) and attach:
&lt;UL&gt;
&lt;LI&gt;The exact query shape you posted.&lt;/LI&gt;
&lt;LI&gt;Output of &lt;CODE&gt;DESCRIBE DETAIL a_big_table;&lt;/CODE&gt;&lt;/LI&gt;
&lt;LI&gt;Output of &lt;CODE&gt;SHOW TBLPROPERTIES a_big_table;&lt;/CODE&gt; (especially &lt;CODE&gt;delta.dataSkippingStatsColumns&lt;/CODE&gt;).&lt;/LI&gt;
&lt;LI&gt;Output of &lt;CODE&gt;OPTIMIZE a_big_table ZORDER BY (event_time, integer_userId)&lt;/CODE&gt; or your Liquid Clustering config.&lt;/LI&gt;
&lt;LI&gt;A comparison of &lt;CODE&gt;EXPLAIN&lt;/CODE&gt; plans with and without the &lt;CODE&gt;integer_pageView&lt;/CODE&gt; predicate. Look for changes in &lt;CODE&gt;DataSkippingFileIndex&lt;/CODE&gt; and the resulting file counts.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;LI&gt;In the meantime, keep using one of:
&lt;UL&gt;
&lt;LI&gt;Your current workaround (drop &lt;CODE&gt;integer_pageView&lt;/CODE&gt; from stats), or&lt;/LI&gt;
&lt;LI&gt;The generated-column pattern above.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;/OL&gt;
&lt;H3&gt;Takeaway&lt;/H3&gt;
&lt;P&gt;I don't know exactly which internal heuristic is causing the planner to downgrade pruning the moment &lt;CODE&gt;integer_pageView&lt;/CODE&gt; appears. Your hypothesis (that high-NULL numeric stats interfere with multi-column skipping) lines up with how a stats-driven index would behave: once a low-quality stats column gets pulled into the decision, it can drag down the overall pruning effectiveness.&lt;/P&gt;
&lt;P&gt;You've done a great job isolating this. It deserves to be escalated so engineering can either improve the behavior or document the limitation.&lt;/P&gt;
&lt;P&gt;Regards, Louis.&lt;/P&gt;</description>
    <pubDate>Mon, 04 May 2026 01:51:49 GMT</pubDate>
    <dc:creator>Louis_Frolio</dc:creator>
    <dc:date>2026-05-04T01:51:49Z</dc:date>
    <item>
      <title>Liquid Clustering file pruning breaks when filtering on a high NULL numeric column in dataSkipping</title>
      <link>https://community.databricks.com/t5/data-engineering/liquid-clustering-file-pruning-breaks-when-filtering-on-a-high/m-p/154915#M54155</link>
      <description>&lt;H3&gt;Environment&lt;/H3&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;Cloud&lt;/STRONG&gt;: AWS&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Compute&lt;/STRONG&gt;: Serverless&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Table&lt;/STRONG&gt;:&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;a_big_table&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Table type&lt;/STRONG&gt;: Streaming Table (SDP pipeline)&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Table size&lt;/STRONG&gt;: 641 GB, 6,210 files&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Liquid Clustering columns&lt;/STRONG&gt;:&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;[event_time, integer_userId]&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;delta.dataSkippingStatsColumns&lt;/STRONG&gt;:&lt;BR /&gt;event_time, integer_userId, integer_pageView, key_appName, key_appType, key_ipCountry, key_userCountry, key_userCity, key_operatingSystem, key_operatingSystemVersion, key_userLanguage&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;OPTIMIZE FULL&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;was run (with clustering) before reproducing&lt;/LI&gt;&lt;/UL&gt;&lt;HR /&gt;&lt;H3&gt;Summary&lt;/H3&gt;&lt;P&gt;Adding&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;any&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;filter on&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;integer_pageView&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;(BIGINT, listed in&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;dataSkippingStatsColumns, ~41% NULL) causes file pruning to degrade from&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;303 → 2,587 files&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;— an 8.5x increase. This happens even with&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;IS NOT NULL.&lt;BR /&gt;The same filter pattern on an identical BIGINT column&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;not&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;in&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;dataSkippingStatsColumns&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;returns 303 files correctly.&lt;/P&gt;&lt;HR /&gt;&lt;H3&gt;Reproduction&lt;/H3&gt;&lt;PRE&gt;base = (
    spark.read.table("a_big_table")
    .filter(
        (col("event_time") &amp;gt;= lit("2026-04-01")) &amp;amp;
        (col("event_time") &amp;lt; lit("2026-04-04")) &amp;amp;
        (col("integer_userId").isNotNull()) &amp;amp;
        (col("key_appType").isin('ios', 'android', 'smart', 'web'))
    )
)

# All return 303 files (correct)
len(base.inputFiles())                                                  # 303
len(base.filter(col('key_operatingSystem') == 'android').inputFiles())   # 303
len(base.filter(col('key_ipCountry') == 'US').inputFiles())              # 303
len(base.filter(col('key_ipCountry').isNotNull()).inputFiles())          # 303
len(base.filter(col('integer_earnPoints') &amp;gt; 0).inputFiles())            # 303  ← BIGINT, NOT in stats
len(base.filter(col('integer_earnPoints').isNotNull()).inputFiles())     # 303  ← BIGINT, NOT in stats

# All return 2,587 files (BUG or something that I don't know)
len(base.filter(col('integer_pageView') &amp;gt; 0).inputFiles())              # 2587 ← BIGINT, IN stats
len(base.filter(col('integer_pageView').isNotNull()).inputFiles())       # 2587 ← BIGINT, IN stats&lt;/PRE&gt;&lt;HR /&gt;&lt;H3&gt;Key observations&lt;/H3&gt;&lt;P&gt;Filter Files read Column in&amp;nbsp;dataSkippingStatsColumns? Column type&lt;/P&gt;&lt;TABLE&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;Base (no extra filter)&lt;/TD&gt;&lt;TD&gt;&lt;STRONG&gt;303&lt;/STRONG&gt;&lt;/TD&gt;&lt;TD&gt;—&lt;/TD&gt;&lt;TD&gt;—&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;integer_pageView &amp;gt; 0&lt;/TD&gt;&lt;TD&gt;&lt;STRONG&gt;2,587&lt;/STRONG&gt;&lt;/TD&gt;&lt;TD&gt;YES&lt;/TD&gt;&lt;TD&gt;BIGINT&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;integer_pageView IS NOT NULL&lt;/TD&gt;&lt;TD&gt;&lt;STRONG&gt;2,587&lt;/STRONG&gt;&lt;/TD&gt;&lt;TD&gt;YES&lt;/TD&gt;&lt;TD&gt;BIGINT&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;integer_earnPoints &amp;gt; 0&lt;/TD&gt;&lt;TD&gt;&lt;STRONG&gt;303&lt;/STRONG&gt;&lt;/TD&gt;&lt;TD&gt;NO&lt;/TD&gt;&lt;TD&gt;BIGINT&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;integer_earnPoints IS NOT NULL&lt;/TD&gt;&lt;TD&gt;&lt;STRONG&gt;303&lt;/STRONG&gt;&lt;/TD&gt;&lt;TD&gt;NO&lt;/TD&gt;&lt;TD&gt;BIGINT&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;key_operatingSystem = 'android'&lt;/TD&gt;&lt;TD&gt;&lt;STRONG&gt;303&lt;/STRONG&gt;&lt;/TD&gt;&lt;TD&gt;YES&lt;/TD&gt;&lt;TD&gt;STRING&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;key_ipCountry IS NOT NULL&lt;/TD&gt;&lt;TD&gt;&lt;STRONG&gt;303&lt;/STRONG&gt;&lt;/TD&gt;&lt;TD&gt;YES&lt;/TD&gt;&lt;TD&gt;STRING&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;key_ipCountry = 'US'&lt;/TD&gt;&lt;TD&gt;&lt;STRONG&gt;303&lt;/STRONG&gt;&lt;/TD&gt;&lt;TD&gt;YES&lt;/TD&gt;&lt;TD&gt;STRING&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;HR /&gt;&lt;H3&gt;Column characteristics&lt;/H3&gt;&lt;UL&gt;&lt;LI&gt;integer_pageView: BIGINT,&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;41% NULL&lt;/STRONG&gt;, min=1 when non-null (never 0), at schema position 66&lt;/LI&gt;&lt;LI&gt;integer_earnPoints: BIGINT, similar NULL pattern, NOT in&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;dataSkippingStatsColumns&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;— works correctly&lt;/LI&gt;&lt;LI&gt;key_operatingSystem: STRING, 1.3% NULL, in&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;dataSkippingStatsColumns&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;— works correctly&lt;/LI&gt;&lt;LI&gt;All other stats columns are STRING type with low NULL rates — all work correctly&lt;/LI&gt;&lt;/UL&gt;&lt;HR /&gt;&lt;H3&gt;Expected behavior&lt;/H3&gt;&lt;P&gt;Adding&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;integer_pageView &amp;gt; 0&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;(or&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;IS NOT NULL) should return&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;≤ 303 files&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;since the filter is AND-combined with&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;event_time&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;range. Liquid Clustering domain pruning on&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;event_time&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;should remain effective regardless of any additional predicate.&lt;/P&gt;&lt;H3&gt;Actual behavior&lt;/H3&gt;&lt;P&gt;Any predicate referencing&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;integer_pageView&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;causes file pruning to degrade to&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;2,587 files&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;(~all OPTIMIZE'd files), suggesting that&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;Liquid Clustering domain pruning on&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;event_time&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;is bypassed&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;when Delta evaluates stats for this column.&lt;/P&gt;&lt;H3&gt;Hypothesis&lt;/H3&gt;&lt;P&gt;integer_pageView&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;is the only&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;non-clustering numeric column&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;with a&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;high NULL rate (41%)&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;in&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;dataSkippingStatsColumns. Many files have all-NULL values → NULL min/max in file stats. When Delta's data skipping evaluates any predicate against these NULL stats, it appears to corrupt the overall pruning pipeline, disabling Liquid Clustering domain pruning for the entire query.&lt;/P&gt;&lt;H3&gt;Workaround&lt;/H3&gt;&lt;P&gt;Remove&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;integer_pageView&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;from&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;delta.dataSkippingStatsColumns. The filter then becomes a post-scan row filter and domain pruning operates normally (303 files).&lt;/P&gt;</description>
      <pubDate>Mon, 20 Apr 2026 08:09:32 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/liquid-clustering-file-pruning-breaks-when-filtering-on-a-high/m-p/154915#M54155</guid>
      <dc:creator>aonurdemir</dc:creator>
      <dc:date>2026-04-20T08:09:32Z</dc:date>
    </item>
    <item>
      <title>Re: Liquid Clustering file pruning breaks when filtering on a high NULL numeric column in dataSkippi</title>
      <link>https://community.databricks.com/t5/data-engineering/liquid-clustering-file-pruning-breaks-when-filtering-on-a-high/m-p/155091#M54186</link>
      <description>&lt;P&gt;Hello !&lt;/P&gt;&lt;P&gt;Thank you for sharing the workaround.&lt;/P&gt;&lt;P&gt;I already had a similar issue and I resolved it&amp;nbsp;by adding&amp;nbsp;has_pageView BOOLEAN GENERATED ALWAYS AS (integer_pageView IS NOT NULL) or pageView_positive BOOLEAN GENERATED ALWAYS AS (integer_pageView &amp;gt; 0) and&amp;nbsp;included that helper in stats and queried on it.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 21 Apr 2026 13:46:06 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/liquid-clustering-file-pruning-breaks-when-filtering-on-a-high/m-p/155091#M54186</guid>
      <dc:creator>amirabedhiafi</dc:creator>
      <dc:date>2026-04-21T13:46:06Z</dc:date>
    </item>
    <item>
      <title>Re: Liquid Clustering file pruning breaks when filtering on a high NULL numeric column in dataSkippi</title>
      <link>https://community.databricks.com/t5/data-engineering/liquid-clustering-file-pruning-breaks-when-filtering-on-a-high/m-p/156033#M54341</link>
      <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/120050"&gt;@aonurdemir&lt;/a&gt;&amp;nbsp;, I looked into your query and have compiled some helpful tips:&lt;/P&gt;
&lt;P&gt;I don't have direct access to your workspace internals, so I can't prove this definitively. But what you're seeing is consistent with how Delta's stats-based data skipping can get confused by a high-NULL numeric column.&lt;/P&gt;
&lt;P&gt;A few mechanics worth keeping in mind:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Data skipping is per-column, per-file. For each stats column, Delta keeps file-level min/max plus a few additional stats (like null count). When the planner sees a predicate on a column listed in &lt;CODE&gt;delta.dataSkippingStatsColumns&lt;/CODE&gt;, it uses those stats to decide which files might contain matching rows.&lt;/LI&gt;
&lt;LI&gt;When a stats column is very sparse (lots of NULLs), its stats are nearly non-discriminative. If many files have &lt;CODE&gt;integer_pageView&lt;/CODE&gt; = NULL for every row, their file-level stats look like "all NULLs." For predicates like &lt;CODE&gt;integer_pageView &amp;gt; 0&lt;/CODE&gt; or &lt;CODE&gt;IS NOT NULL&lt;/CODE&gt;, the engine can't safely prove those files have no matches, so it keeps them.&lt;/LI&gt;
&lt;LI&gt;The subtle part is the interaction between columns. In your base query (without &lt;CODE&gt;integer_pageView&lt;/CODE&gt;), Delta combines &lt;CODE&gt;event_time&lt;/CODE&gt; and the other selective stats columns and prunes down to 303 files. The moment you reference &lt;CODE&gt;integer_pageView&lt;/CODE&gt; (a stats column with poor selectivity and many NULL-only files), the planner appears to fall back to a less effective skipping strategy and keeps 2,587 files. That's why the same pattern on &lt;CODE&gt;integer_earnPoints&lt;/CODE&gt; (not in stats) still returns 303 files. That predicate is applied row-level, after the good file pruning has already happened.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;I agree with your core conclusion:&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;high-NULL numeric column + being in &lt;CODE&gt;delta.dataSkippingStatsColumns&lt;/CODE&gt; can degrade overall file pruning, even when combined with a highly selective Liquid-clustered column like &lt;CODE&gt;event_time&lt;/CODE&gt;.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;Whether we call that a bug or a limitation of the current skipping algorithm is mostly semantics. From a user's point of view, it's surprising behavior, and it deserves attention.&lt;/P&gt;
&lt;H3&gt;Why the workaround works&lt;/H3&gt;
&lt;P&gt;Your workaround:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class="language-sql"&gt;ALTER TABLE a_big_table
  SET TBLPROPERTIES (
    'delta.dataSkippingStatsColumns' = 'event_time,integer_userId,key_appName,...(minus integer_pageView)'
  );
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;forces &lt;CODE&gt;integer_pageView&lt;/CODE&gt; back into being a row-level filter only. So:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;Delta uses good stats on &lt;CODE&gt;event_time&lt;/CODE&gt; and the other selective columns to prune down to roughly 303 files.&lt;/LI&gt;
&lt;LI&gt;&lt;CODE&gt;integer_pageView&lt;/CODE&gt; gets evaluated inside those files without disturbing the skipping plan.&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;That's exactly what you want.&lt;/P&gt;
&lt;H3&gt;Alternative pattern (credit to @amirabedhiafi)&lt;/H3&gt;
&lt;P&gt;The generated-column approach @amirabedhiafi suggested is a clean compromise:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class="language-sql"&gt;ALTER TABLE a_big_table
ADD COLUMN has_pageView BOOLEAN
  GENERATED ALWAYS AS (integer_pageView IS NOT NULL);

-- Or, if you only care about &amp;gt; 0:
ALTER TABLE a_big_table
ADD COLUMN pageView_positive BOOLEAN
  GENERATED ALWAYS AS (integer_pageView &amp;gt; 0);
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Then:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Add &lt;CODE&gt;has_pageView&lt;/CODE&gt; or &lt;CODE&gt;pageView_positive&lt;/CODE&gt; to &lt;CODE&gt;delta.dataSkippingStatsColumns&lt;/CODE&gt;.&lt;/LI&gt;
&lt;LI&gt;Filter on the helper column instead of the raw BIGINT:&lt;/LI&gt;
&lt;/UL&gt;
&lt;PRE&gt;&lt;CODE class="language-python"&gt;len(base.filter(col("pageView_positive") == True).inputFiles())
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;The helper column is boolean and doesn't suffer from the same "all values NULL" pattern. Its file-level stats are much more selective and play nicely with Liquid clustering and the other stats columns.&lt;/P&gt;
&lt;H3&gt;What to do next&lt;/H3&gt;
&lt;P&gt;You have a very clean reproduction and clear metrics (303 versus 2,587 files), so you're well positioned to get this in front of engineering. Do these in order:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;Open a Databricks Support ticket (or go through your account team) and attach:
&lt;UL&gt;
&lt;LI&gt;The exact query shape you posted.&lt;/LI&gt;
&lt;LI&gt;Output of &lt;CODE&gt;DESCRIBE DETAIL a_big_table;&lt;/CODE&gt;&lt;/LI&gt;
&lt;LI&gt;Output of &lt;CODE&gt;SHOW TBLPROPERTIES a_big_table;&lt;/CODE&gt; (especially &lt;CODE&gt;delta.dataSkippingStatsColumns&lt;/CODE&gt;).&lt;/LI&gt;
&lt;LI&gt;Output of &lt;CODE&gt;OPTIMIZE a_big_table ZORDER BY (event_time, integer_userId)&lt;/CODE&gt; or your Liquid Clustering config.&lt;/LI&gt;
&lt;LI&gt;A comparison of &lt;CODE&gt;EXPLAIN&lt;/CODE&gt; plans with and without the &lt;CODE&gt;integer_pageView&lt;/CODE&gt; predicate. Look for changes in &lt;CODE&gt;DataSkippingFileIndex&lt;/CODE&gt; and the resulting file counts.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;LI&gt;In the meantime, keep using one of:
&lt;UL&gt;
&lt;LI&gt;Your current workaround (drop &lt;CODE&gt;integer_pageView&lt;/CODE&gt; from stats), or&lt;/LI&gt;
&lt;LI&gt;The generated-column pattern above.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;/OL&gt;
&lt;H3&gt;Takeaway&lt;/H3&gt;
&lt;P&gt;I don't know exactly which internal heuristic is causing the planner to downgrade pruning the moment &lt;CODE&gt;integer_pageView&lt;/CODE&gt; appears. Your hypothesis (that high-NULL numeric stats interfere with multi-column skipping) lines up with how a stats-driven index would behave: once a low-quality stats column gets pulled into the decision, it can drag down the overall pruning effectiveness.&lt;/P&gt;
&lt;P&gt;You've done a great job isolating this. It deserves to be escalated so engineering can either improve the behavior or document the limitation.&lt;/P&gt;
&lt;P&gt;Regards, Louis.&lt;/P&gt;</description>
      <pubDate>Mon, 04 May 2026 01:51:49 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/liquid-clustering-file-pruning-breaks-when-filtering-on-a-high/m-p/156033#M54341</guid>
      <dc:creator>Louis_Frolio</dc:creator>
      <dc:date>2026-05-04T01:51:49Z</dc:date>
    </item>
  </channel>
</rss>

