Hi community,
When using Delta Lake, data skipping relies on column statistics (min/max values). By default, we collect stats for:
The first 32 columns in the table (based on position) and 4 special columns.
This gives roughly 36 columns with stats.
However, in practice, queries often filter on other columns, especially in operational datasets. These columns may not be among the first 32 columns, which reduces the effectiveness of data skipping.
Current Approach
Start with first 32 columns from the table DDL.
Add special columns at the end.
From query metadata, identify columns used in filters.
Append to the existing stats columns list, ensuring no duplicates with the first 32 or special columns.
Challenge
How many additional columns (filter columns) to include as Hundreds of filter columns may exist in the table?
At what point does collecting stats become overhead?
Appreciate any insights!