Data Skipping- Partitioned tables

Sainath368 — Fri, 13 Jun 2025 15:28:57 GMT

Hi all,

I have a question- how can we modify delta.dataSkippingStatsColumns and compute statistics for a partitioned delta table in Databricks? I want to understand the process and best practices for changing this setting and ensuring accurate statistical computations for partitioned data. Any guidance would be appreciated.

Re: Data Skipping- Partitioned tables

paolajara — Wed, 18 Jun 2025 16:16:07 GMT

Hi, delta.dataSkippingStatsColumns specifies a coma-separated list of column names used by Delta Lake to collect statistics. It will improve the performance by skipping those columns since it will supersede the default behavior of analyzing the first 32 columns of the table.

Set property when creating or modifying a table:
1. ALTER TABLE your_table_name
  SET TBLPROPERTIES('delta.dataSkippingStatsColumns' = 'col1, col2, col3');
Databricks Runtime (DBR) 14.3 LTS and above, you can manually compute statistics for existing and future data:
1. ANALYZE TABLE your_table_name COMPUTE DELTA STATISTICS;
Avoid using Long String Columns since they expensive to analyze. These should be included in delta.dataSkippingStatsColumns
it would be nice for the columns specified for statistics overlap with filtering criteria i.e. partitioning column or high cardinality columns. Statistics on unused or less-filtered columns could waste compute resources without significant benefits.

topic Re: Data Skipping- Partitioned tables in Data Engineering

Data Skipping- Partitioned tables

Re: Data Skipping- Partitioned tables