cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Data Skipping- Partitioned tables

Sainath368
New Contributor III

Hi all,

I have a question- how can we modify delta.dataSkippingStatsColumns and compute statistics for a partitioned delta table in Databricks? I want to understand the process and best practices for changing this setting and ensuring accurate statistical computations for partitioned data. Any guidance would be appreciated.

1 REPLY 1

paolajara
Databricks Employee
Databricks Employee

Hi, delta.dataSkippingStatsColumns specifies a coma-separated list of column names used by Delta Lake to collect statistics. It will improve the performance by skipping those columns since it will supersede the default behavior of analyzing the first 32 columns of the table. 

  1. Set property when creating or modifying a table: 
    1. ALTER TABLE your_table_name
      SET TBLPROPERTIES('delta.dataSkippingStatsColumns' = 'col1, col2, col3'); 
  2. Databricks Runtime (DBR) 14.3 LTS and above, you can manually compute statistics for existing and future data:
    1. ANALYZE TABLE your_table_name COMPUTE DELTA STATISTICS;
  3. Avoid using Long String Columns since they expensive to analyze. These should be included in delta.dataSkippingStatsColumns
  4. it would be nice for the columns specified for statistics overlap with filtering criteria i.e. partitioning column or high cardinality columns. Statistics on unused or less-filtered columns could waste compute resources without significant benefits.  

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now