01-07-2026 08:00 AM - edited 01-07-2026 08:01 AM
Hi community,
When using Delta Lake, data skipping relies on column statistics (min/max values). By default, we collect stats for:
The first 32 columns in the table (based on position) and 4 special columns.
This gives roughly 36 columns with stats.
However, in practice, queries often filter on other columns, especially in operational datasets. These columns may not be among the first 32 columns, which reduces the effectiveness of data skipping.
Current Approach
Start with first 32 columns from the table DDL.
Add special columns at the end.
From query metadata, identify columns used in filters.
Append to the existing stats columns list, ensuring no duplicates with the first 32 or special columns.
Challenge
How many additional columns (filter columns) to include as Hundreds of filter columns may exist in the table?
At what point does collecting stats become overhead?
Appreciate any insights!
01-07-2026 09:01 AM
@pooja_bhumandla We can increase the number of columns on which statistics are collected using the following property: dataSkippingNumIndexedCols
Please refer to the following document for additional information on this : https://docs.databricks.com/aws/en/delta/data-skipping
However, if we add large number of columns to collect the stats it will add extra complexity at the write time for computing the stats and increases transaction‑log metadata size, which affects planning time and driver memory.
01-07-2026 09:32 PM
Thanks for the response.
In our case, we are explicitly configuring:
delta.dataSkippingStatsColumns = 'col1, col2, col3, ...' rather than increasing dataSkippingNumIndexedCols.
Is there a recommended or commonly observed upper bound on the number of columns specified in delta.dataSkippingStatsColumns, beyond which causes the overhead or not beneficial.
Any guidance or real-world limits would be helpful.
01-07-2026 09:19 PM
Hi @pooja_bhumandla - If your table is a managed table, better enable the predictive optimization, this way Databricks will automatically runs analyze and collect the stats.
2 weeks ago
Hi @pooja_bhumandla,
This is a common scenario when working with wide operational tables, and there are a few approaches and guidelines to consider.
CONFIGURING WHICH COLUMNS GET STATISTICS
Rather than increasing the numeric limit with delta.dataSkippingNumIndexedCols, consider using the delta.dataSkippingStatsColumns table property (available on Databricks Runtime 13.3+). This lets you explicitly name the columns that should have statistics collected, rather than relying on schema position:
ALTER TABLE my_catalog.my_schema.my_table
SET TBLPROPERTIES(
'delta.dataSkippingStatsColumns' = 'col_a, col_b, col_c, col_d'
);
This property takes precedence over dataSkippingNumIndexedCols when set, so you get precise control over exactly which columns are tracked.
If you prefer the position-based approach, you can set dataSkippingNumIndexedCols to a higher value, or set it to -1 to collect stats on all columns (though this is rarely advisable for wide tables).
HOW MANY COLUMNS IS TOO MANY?
There is no single magic number, but here are practical guidelines:
1. Write overhead: Column statistics (min, max, null count) are computed at write time for every data file and stored in the Delta transaction log as JSON. Each additional stats column adds computation during writes and increases the size of the transaction log metadata. For write-heavy workloads, this overhead grows linearly with the number of stats columns.
2. Driver memory and query planning: The query planner reads these statistics at planning time. Significantly inflating the log metadata can increase planning time and driver memory pressure, especially on tables with many small files.
3. Practical recommendation: Focus on the columns that appear most frequently in WHERE/JOIN predicates across your most critical or expensive queries. A good strategy is to:
- Start with your top 10-20 most frequently filtered columns (from your query metadata analysis).
- Combine those with any columns used in Z-ORDER BY or as liquid clustering keys.
- Keep the total under roughly 50-60 stats columns unless you have benchmarked the write and planning overhead for your specific workload.
4. Diminishing returns: If a column has high cardinality with values spread across all files (not well-clustered), data skipping statistics on that column will not help much regardless. Stats are most effective on columns whose values are naturally or intentionally clustered within files.
COMPLEMENTARY STRATEGIES
Instead of collecting stats on hundreds of columns, consider these approaches that can be more effective:
1. Liquid clustering: If you are creating new tables or can migrate existing ones, liquid clustering (CLUSTER BY) is the recommended approach. It physically co-locates data by your chosen keys (up to 4), making data skipping extremely effective for those columns without manual OPTIMIZE/ZORDER. Clustering keys must have statistics collected.
CREATE TABLE my_table (...)
CLUSTER BY (filter_col_1, filter_col_2);
2. Column reordering: Since the default 32-column limit is position-based, you can reorder your schema so that the most commonly filtered columns appear first. Combined with delta.dataSkippingStatsColumns, this gives you full control.
3. Predictive optimization: If you are using Unity Catalog managed tables on Premium tier, predictive optimization automatically handles OPTIMIZE and ANALYZE operations, including incremental statistics updates. This reduces the manual maintenance burden.
ALTER TABLE my_table
SET TBLPROPERTIES('delta.enableDeletionVectors' = true);
-- Predictive optimization is enabled at the schema/catalog level
4. Partition pruning for very high-cardinality filters: For columns with a small number of distinct values that you always filter on (like region, date, status), partitioning may still be effective alongside data skipping.
SUMMARY
- Use delta.dataSkippingStatsColumns to explicitly target your most critical filter columns rather than inflating the numeric limit broadly.
- Keep the stats column count to a reasonable number (under 50-60 for most workloads) and benchmark your write latency if you go higher.
- Pair statistics collection with liquid clustering or Z-ORDER on your highest-priority filter columns for the best data skipping results.
- Not all columns benefit equally from stats: focus on those where values are clustered within files.
Documentation references:
https://docs.databricks.com/en/delta/data-skipping.html
https://docs.databricks.com/en/delta/table-properties.html
https://docs.databricks.com/en/delta/clustering.html
https://docs.databricks.com/en/optimizations/predictive-optimization.html
* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.
If this answer resolves your question, could you mark it as "Accept as Solution"? That helps other users quickly find the correct fix.