cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Collecting Delta Stats for Columns Used in Filters Beyond Default First 32 Columns

pooja_bhumandla
New Contributor III

Hi community,

When using Delta Lake, data skipping relies on column statistics (min/max values). By default, we collect stats for:
The first 32 columns in the table (based on position) and 4 special columns.
This gives roughly 36 columns with stats.
However, in practice, queries often filter on other columns, especially in operational datasets. These columns may not be among the first 32 columns, which reduces the effectiveness of data skipping.

Current Approach
Start with first 32 columns from the table DDL.
Add special columns at the end.
From query metadata, identify columns used in filters.
Append to the existing stats columns list, ensuring no duplicates with the first 32 or special columns.

Challenge
How many additional columns (filter columns) to include as Hundreds of filter columns may exist in the table?
At what point does collecting stats become overhead?

Appreciate any insights!

3 REPLIES 3

JAHNAVI
Databricks Employee
Databricks Employee

@pooja_bhumandla We can increase the number of columns on which statistics are collected using the following property: dataSkippingNumIndexedCols

Please refer to the following document for additional information on this : https://docs.databricks.com/aws/en/delta/data-skipping

However, if we add large number of columns to collect the stats it will add extra complexity at the write time for computing the stats and increases transaction‑log metadata size, which affects planning time and driver memory.

Jahnavi N

@JAHNAVI 

Thanks for the response.

In our case, we are explicitly configuring:
delta.dataSkippingStatsColumns = 'col1, col2, col3, ...' rather than increasing dataSkippingNumIndexedCols.


Is there a recommended or commonly observed upper bound on the number of columns specified in delta.dataSkippingStatsColumns, beyond which causes the overhead or not beneficial.


Any guidance or real-world limits would be helpful.

Sanjeeb2024
Contributor III

Hi @pooja_bhumandla - If your table is a managed table, better enable the predictive optimization, this way Databricks will automatically runs analyze and collect the stats.

 

Sanjeeb Mohapatra

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now