cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Best Practice for Updating Data Skipping Statistics for Additional Columns

pooja_bhumandla
New Contributor III

Hi Community,
I have a scenario where I’ve already calculated delta statistics for the first 32 columns after enabling the dataskipping property. Now, I need to include 10 more frequently used columns that were not part of the original 32.

Goal:
I want a robust way to calculate statistics for the new 10 columns along with the existing 32 columns without unnecessary recomputation or fragmented maintenance.

Here’s the challenge:
1. If I alter the dataskipping property for all 42 columns (32 existing + 10 new) and recompute statistics for all tables, it introduces unnecessary recomputation.

What’s the recommended approach or best practice here?

Appreciate any insights!

3 REPLIES 3

szymon_dybczak
Esteemed Contributor III

Hi @pooja_bhumandla ,

Updating any of two below options does not automatically recompute statistics for existing data. Rather, it impacts the behavior of future statistics collection when adding or updating data in the table.

- delta.dataSkippingNumIndexedCols

- delta.dataSkippingStatsColumns

In Databricks Runtime 14.3 LTS and above, if you have altered the table properties or changed the specified columns for statistics, you can manually trigger the recomputation of statistics for a Delta table using the following command:

ANALYZE TABLE table_name COMPUTE DELTA STATISTICS

But you don't have to recompute statistics for all columns. This command supports listing the columns for which you'd like to refresh stats.

szymon_dybczak_0-1763129695038.png

So, either delta.dataSkippingNumIndexedCols to override 32 col limitation and then run analyze table manually or use delta.dataSkippingStatsColumns, list all of required columns and run analyze table manually.

Hi @szymon_dybczak, thank you for your response. 

If I update the table property to include all 42 columns (new 10 + old 32 cols) in delta.dataSkippingStatsColumns,

pooja_bhumandla_0-1763366089648.png

and then run ANALYZE TABLE only for the new 10 columns (col33–col42):

pooja_bhumandla_1-1763366216377.png

My Questions:

  • Will this keep the existing statistics for columns 1–32 unchanged and only compute statistics for columns 33–42?
  • Are the statistics for both the old (1–32) and new (33–42) columns maintained together in the same centralized metadata location, or are they managed separately?
  • Does this approach ensure that all 42 columns will have statistics available for data skipping without any redundant recomputation? (i.e., existing stats stay as-is, and only new stats are added)
  • After setting the dataSkipping property for all 42 columns, will future data loads automatically generate and maintain statistics for all 42 columns going forward?

Appreciate any insights and clarifications!

 

Hi @pooja_bhumandla ,

To answer your second set of questions directly:

1) Will running ANALYZE for only col33–col42 leave cols 1–32 unchanged?
Yes. With the table property set to include all 42 columns, you can run a targeted recomputation just for the new 10 columns:

   ANALYZE TABLE your_catalog.your_schema.your_table
   COMPUTE DELTA STATISTICS
   FOR COLUMNS col33, col34, col35, col36, col37, col38, col39, col40, col41, col42;
This recomputes Delta file-skipping stats only for the specified columns. Existing stats on cols 1–32 are left as-is.
 
2) Are stats for old and new columns kept together centrally or separately?
The
y’re maintained together in the Delta log at the file level (one set of file-level stats per file, including the configured columns). There isn’t a separate store per column. The stats schema in the log enumerates which columns have statistics and stores their minimum, maximum, and null counts for each file.
 
3) Does this avoid redundant recomputation?
Yes. Because you target only col33–col42 with FOR COLUMNS, only those columns are recomputed. Existing stats for cols 1–32 remain untouched, so you avoid recomputing work already done.
 
4) Will future data loads automatically maintain stats for all 42 columns?
Yes. After setting delta.dataSkippingStatsColumns (or raising delta.dataSkippingNumIndexedCols), future writes automatically collect file-skipping stats for the configured columns. Property changes affect future stats collection and don’t retroactively recompute for existing files.