cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Best Practice for Updating Data Skipping Statistics for Additional Columns

pooja_bhumandla
New Contributor III

Hi Community,
I have a scenario where I’ve already calculated delta statistics for the first 32 columns after enabling the dataskipping property. Now, I need to include 10 more frequently used columns that were not part of the original 32.

Goal:
I want a robust way to calculate statistics for the new 10 columns along with the existing 32 columns without unnecessary recomputation or fragmented maintenance.

Here’s the challenge:
1. If I alter the dataskipping property for all 42 columns (32 existing + 10 new) and recompute statistics for all tables, it introduces unnecessary recomputation.

What’s the recommended approach or best practice here?

Appreciate any insights!

2 REPLIES 2

szymon_dybczak
Esteemed Contributor III

Hi @pooja_bhumandla ,

Updating any of two below options does not automatically recompute statistics for existing data. Rather, it impacts the behavior of future statistics collection when adding or updating data in the table.

- delta.dataSkippingNumIndexedCols

- delta.dataSkippingStatsColumns

In Databricks Runtime 14.3 LTS and above, if you have altered the table properties or changed the specified columns for statistics, you can manually trigger the recomputation of statistics for a Delta table using the following command:

ANALYZE TABLE table_name COMPUTE DELTA STATISTICS

But you don't have to recompute statistics for all columns. This command supports listing the columns for which you'd like to refresh stats.

szymon_dybczak_0-1763129695038.png

So, either delta.dataSkippingNumIndexedCols to override 32 col limitation and then run analyze table manually or use delta.dataSkippingStatsColumns, list all of required columns and run analyze table manually.

Hi @szymon_dybczak, thank you for your response. 

If I update the table property to include all 42 columns (new 10 + old 32 cols) in delta.dataSkippingStatsColumns,

pooja_bhumandla_0-1763366089648.png

and then run ANALYZE TABLE only for the new 10 columns (col33–col42):

pooja_bhumandla_1-1763366216377.png

My Questions:

  • Will this keep the existing statistics for columns 1–32 unchanged and only compute statistics for columns 33–42?
  • Are the statistics for both the old (1–32) and new (33–42) columns maintained together in the same centralized metadata location, or are they managed separately?
  • Does this approach ensure that all 42 columns will have statistics available for data skipping without any redundant recomputation? (i.e., existing stats stay as-is, and only new stats are added)
  • After setting the dataSkipping property for all 42 columns, will future data loads automatically generate and maintain statistics for all 42 columns going forward?

Appreciate any insights and clarifications!

 

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now