cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Best Practice for Updating Data Skipping Statistics for Additional Columns

pooja_bhumandla
New Contributor III

Hi Community,
I have a scenario where Iโ€™ve already calculated delta statistics for the first 32 columns after enabling the dataskipping property. Now, I need to include 10 more frequently used columns that were not part of the original 32.

Goal:
I want a robust way to calculate statistics for the new 10 columns along with the existing 32 columns without unnecessary recomputation or fragmented maintenance.

Hereโ€™s the challenge:
1. If I alter the dataskipping property for all 42 columns (32 existing + 10 new) and recompute statistics for all tables, it introduces unnecessary recomputation.

Whatโ€™s the recommended approach or best practice here?

Appreciate any insights!

1 REPLY 1

szymon_dybczak
Esteemed Contributor III

Hi @pooja_bhumandla ,

Updating any of two below options does not automatically recompute statistics for existing data. Rather, it impacts the behavior of future statistics collection when adding or updating data in the table.

- delta.dataSkippingNumIndexedCols

- delta.dataSkippingStatsColumns

In Databricks Runtime 14.3 LTS and above, if you have altered the table properties or changed the specified columns for statistics, you can manually trigger the recomputation of statistics for a Delta table using the following command:

ANALYZE TABLE table_name COMPUTE DELTA STATISTICS

But you don't have to recompute statistics for all columns. This command supports listing the columns for which you'd like to refresh stats.

szymon_dybczak_0-1763129695038.png

So, either delta.dataSkippingNumIndexedCols to override 32 col limitation and then run analyze table manually or use delta.dataSkippingStatsColumns, list all of required columns and run analyze table manually.