Liquid Clustering file pruning breaks when filtering on a high NULL numeric column in dataSkipping

aonurdemir
Contributor

Environment

  • Cloud: AWS
  • Compute: Serverless
  • Table: a_big_table
  • Table type: Streaming Table (SDP pipeline)
  • Table size: 641 GB, 6,210 files
  • Liquid Clustering columns: [event_time, integer_userId]
  • delta.dataSkippingStatsColumns:
    event_time, integer_userId, integer_pageView, key_appName, key_appType, key_ipCountry, key_userCountry, key_userCity, key_operatingSystem, key_operatingSystemVersion, key_userLanguage
  • OPTIMIZE FULL was run (with clustering) before reproducing

Summary

Adding any filter on integer_pageView (BIGINT, listed in dataSkippingStatsColumns, ~41% NULL) causes file pruning to degrade from 303 → 2,587 files — an 8.5x increase. This happens even with IS NOT NULL.
The same filter pattern on an identical BIGINT column not in dataSkippingStatsColumns returns 303 files correctly.


Reproduction

base = (
    spark.read.table("a_big_table")
    .filter(
        (col("event_time") >= lit("2026-04-01")) &
        (col("event_time") < lit("2026-04-04")) &
        (col("integer_userId").isNotNull()) &
        (col("key_appType").isin('ios', 'android', 'smart', 'web'))
    )
)

# All return 303 files (correct)
len(base.inputFiles())                                                  # 303
len(base.filter(col('key_operatingSystem') == 'android').inputFiles())   # 303
len(base.filter(col('key_ipCountry') == 'US').inputFiles())              # 303
len(base.filter(col('key_ipCountry').isNotNull()).inputFiles())          # 303
len(base.filter(col('integer_earnPoints') > 0).inputFiles())            # 303  ← BIGINT, NOT in stats
len(base.filter(col('integer_earnPoints').isNotNull()).inputFiles())     # 303  ← BIGINT, NOT in stats

# All return 2,587 files (BUG or something that I don't know)
len(base.filter(col('integer_pageView') > 0).inputFiles())              # 2587 ← BIGINT, IN stats
len(base.filter(col('integer_pageView').isNotNull()).inputFiles())       # 2587 ← BIGINT, IN stats

Key observations

Filter Files read Column in dataSkippingStatsColumns? Column type

Base (no extra filter)303
integer_pageView > 02,587YESBIGINT
integer_pageView IS NOT NULL2,587YESBIGINT
integer_earnPoints > 0303NOBIGINT
integer_earnPoints IS NOT NULL303NOBIGINT
key_operatingSystem = 'android'303YESSTRING
key_ipCountry IS NOT NULL303YESSTRING
key_ipCountry = 'US'303YESSTRING

Column characteristics

  • integer_pageView: BIGINT, 41% NULL, min=1 when non-null (never 0), at schema position 66
  • integer_earnPoints: BIGINT, similar NULL pattern, NOT in dataSkippingStatsColumns — works correctly
  • key_operatingSystem: STRING, 1.3% NULL, in dataSkippingStatsColumns — works correctly
  • All other stats columns are STRING type with low NULL rates — all work correctly

Expected behavior

Adding integer_pageView > 0 (or IS NOT NULL) should return ≤ 303 files since the filter is AND-combined with event_time range. Liquid Clustering domain pruning on event_time should remain effective regardless of any additional predicate.

Actual behavior

Any predicate referencing integer_pageView causes file pruning to degrade to 2,587 files (~all OPTIMIZE'd files), suggesting that Liquid Clustering domain pruning on event_time is bypassed when Delta evaluates stats for this column.

Hypothesis

integer_pageView is the only non-clustering numeric column with a high NULL rate (41%) in dataSkippingStatsColumns. Many files have all-NULL values → NULL min/max in file stats. When Delta's data skipping evaluates any predicate against these NULL stats, it appears to corrupt the overall pruning pipeline, disabling Liquid Clustering domain pruning for the entire query.

Workaround

Remove integer_pageView from delta.dataSkippingStatsColumns. The filter then becomes a post-scan row filter and domain pruning operates normally (303 files).