Hello @aonurdemir , I looked into your query and have compiled some helpful tips:
I don't have direct access to your workspace internals, so I can't prove this definitively. But what you're seeing is consistent with how Delta's stats-based data skipping can get confused by a high-NULL numeric column.
A few mechanics worth keeping in mind:
- Data skipping is per-column, per-file. For each stats column, Delta keeps file-level min/max plus a few additional stats (like null count). When the planner sees a predicate on a column listed in
delta.dataSkippingStatsColumns, it uses those stats to decide which files might contain matching rows.
- When a stats column is very sparse (lots of NULLs), its stats are nearly non-discriminative. If many files have
integer_pageView = NULL for every row, their file-level stats look like "all NULLs." For predicates like integer_pageView > 0 or IS NOT NULL, the engine can't safely prove those files have no matches, so it keeps them.
- The subtle part is the interaction between columns. In your base query (without
integer_pageView), Delta combines event_time and the other selective stats columns and prunes down to 303 files. The moment you reference integer_pageView (a stats column with poor selectivity and many NULL-only files), the planner appears to fall back to a less effective skipping strategy and keeps 2,587 files. That's why the same pattern on integer_earnPoints (not in stats) still returns 303 files. That predicate is applied row-level, after the good file pruning has already happened.
I agree with your core conclusion:
high-NULL numeric column + being in delta.dataSkippingStatsColumns can degrade overall file pruning, even when combined with a highly selective Liquid-clustered column like event_time.
Whether we call that a bug or a limitation of the current skipping algorithm is mostly semantics. From a user's point of view, it's surprising behavior, and it deserves attention.
Why the workaround works
Your workaround:
ALTER TABLE a_big_table
SET TBLPROPERTIES (
'delta.dataSkippingStatsColumns' = 'event_time,integer_userId,key_appName,...(minus integer_pageView)'
);
forces integer_pageView back into being a row-level filter only. So:
- Delta uses good stats on
event_time and the other selective columns to prune down to roughly 303 files.
integer_pageView gets evaluated inside those files without disturbing the skipping plan.
That's exactly what you want.
Alternative pattern (credit to @amirabedhiafi)
The generated-column approach @amirabedhiafi suggested is a clean compromise:
ALTER TABLE a_big_table
ADD COLUMN has_pageView BOOLEAN
GENERATED ALWAYS AS (integer_pageView IS NOT NULL);
-- Or, if you only care about > 0:
ALTER TABLE a_big_table
ADD COLUMN pageView_positive BOOLEAN
GENERATED ALWAYS AS (integer_pageView > 0);
Then:
- Add
has_pageView or pageView_positive to delta.dataSkippingStatsColumns.
- Filter on the helper column instead of the raw BIGINT:
len(base.filter(col("pageView_positive") == True).inputFiles())
The helper column is boolean and doesn't suffer from the same "all values NULL" pattern. Its file-level stats are much more selective and play nicely with Liquid clustering and the other stats columns.
What to do next
You have a very clean reproduction and clear metrics (303 versus 2,587 files), so you're well positioned to get this in front of engineering. Do these in order:
- Open a Databricks Support ticket (or go through your account team) and attach:
- The exact query shape you posted.
- Output of
DESCRIBE DETAIL a_big_table;
- Output of
SHOW TBLPROPERTIES a_big_table; (especially delta.dataSkippingStatsColumns).
- Output of
OPTIMIZE a_big_table ZORDER BY (event_time, integer_userId) or your Liquid Clustering config.
- A comparison of
EXPLAIN plans with and without the integer_pageView predicate. Look for changes in DataSkippingFileIndex and the resulting file counts.
- In the meantime, keep using one of:
- Your current workaround (drop
integer_pageView from stats), or
- The generated-column pattern above.
Takeaway
I don't know exactly which internal heuristic is causing the planner to downgrade pruning the moment integer_pageView appears. Your hypothesis (that high-NULL numeric stats interfere with multi-column skipping) lines up with how a stats-driven index would behave: once a low-quality stats column gets pulled into the decision, it can drag down the overall pruning effectiveness.
You've done a great job isolating this. It deserves to be escalated so engineering can either improve the behavior or document the limitation.
Regards, Louis.