cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Clarification Needed: COMPUTE STATISTICS vs COMPUTE DELTA STATISTICS on Delta Tables

Sainath368
New Contributor III

Hi everyone,

I’m trying to understand the difference between the two commands in Databricks:

  • ANALYZE TABLE <table_name> COMPUTE STATISTICS

  • ANALYZE TABLE <table_name> COMPUTE DELTA STATISTICS

Specifically:

  1. What exactly does each command do, and how do they differ, especially in the context of Delta Lake tables?

  2. How often is it recommended to run COMPUTE STATISTICS for optimal query performance and maintenance? For example, is running it every week a good practice, or is that unnecessary overhead?

Thanks in advance for any insights or recommendations!

6 REPLIES 6

BigRoux
Databricks Employee
Databricks Employee
The two commands in Databricks—ANALYZE TABLE <table_name> COMPUTE STATISTICS and ANALYZE TABLE <table_name> COMPUTE DELTA STATISTICS—serve different purposes and have distinct characteristics:
  1. ANALYZE TABLE <table_name> COMPUTE STATISTICS:
    • This command collects table-level statistics, such as the number of rows and the size in bytes. It can also collect column-level statistics such as minimum, maximum, number of nulls, and distinct values for specified or all columns.
    • These statistics are utilized by the query optimizer to generate an optimal query plan, which helps improve query performance and execution.
    • Column statistics are particularly useful for cost-based optimization (CBO) scenarios, helping with join ordering and other query plan optimizations.
    • Recommended Usage: It is generally advised to run this command after significant data modifications (e.g., deletes, updates, or merges) or when 10% or more of the data has changed.
  2. ANALYZE TABLE <table_name> COMPUTE DELTA STATISTICS:
    • This command recomputes file-level statistics stored in the Delta log, which are used for features like data skipping. Data skipping helps query performance by filtering out unnecessary files during execution based on column-level statistics.
    • When the DELTA keyword is specified, it does not generate regular table statistics for the query optimizer; its focus is primarily on Delta-specific enhancements.
    • This command may be triggered manually, especially after adjusting properties related to data skipping (e.g., specifying columns for statistics collection through delta.dataSkippingStatsColumns or delta.dataSkippingNumIndexedCols).
Differences in Functionality: - COMPUTE STATISTICS enhances query optimization (CBO), whereas COMPUTE DELTA STATISTICS targets Delta-specific optimizations like file skipping. - COMPUTE DELTA STATISTICS focuses on updating metadata in the Delta log, while COMPUTE STATISTICS computes a broader set of metrics useful for generic query optimization.
 
Recommendations for Running These Commands: - Running COMPUTE STATISTICS weekly or whenever major data changes occur is a good practice for maintaining optimal query performance, as stale statistics can negatively impact query optimization. - COMPUTE DELTA STATISTICS may not need frequent manual triggering if predictive optimization (automatic stats collection) is enabled, though it can be run to update metadata after schema changes or configuration adjustments. - It is advisable to assess your workload and data mutation patterns to calibrate the frequency of running these commands, avoiding unnecessary resource overhead.
 
These distinctions ensure that both commands are used effectively based on your desired optimization goals. If predictive optimization is enabled, you might not need to run these commands frequently, as stats collection is triggered automatically for Unity Catalog managed tables.

Sainath368
New Contributor III

I ran ANALYZE TABLE <table_name> COMPUTE DELTA STATISTICS on my tables to calculate column-level statistics at the file level for each Parquet file inside the Delta table. Now, I’m wondering which command is better to run next: ANALYZE TABLE <table_name> COMPUTE STATISTICS or ANALYZE TABLE <table_name> COMPUTE STATISTICS FOR ALL COLUMNS? Since I already have column-level stats computed via delta statistics, I want to understand the difference and what the query optimizer uses, how it works?

Sainath368
New Contributor III

I ran ANALYZE TABLE <table_name> COMPUTE DELTA STATISTICS on my tables to calculate column-level statistics at the file level for each Parquet file inside the Delta table. Now, I’m wondering which command is better to run next: ANALYZE TABLE <table_name> COMPUTE STATISTICS or ANALYZE TABLE <table_name> COMPUTE STATISTICS FOR ALL COLUMNS? Since I already have column-level stats computed via delta statistics, I want to understand the difference and what the query optimizer uses, how it works?

BigRoux
Databricks Employee
Databricks Employee

The Delta statistics is mostly used by the Delta Protocol to do its thing.  Analyze Table ...  is going to be extremely beneficial to Spark and the Catalyst Optimizer.  You need both. Delta is automatic. Analyze table is done manually or, if you are using Unity Catalog with Managed tables it will be automatic there too. See below

 

Unity Catalog and managed tables in Databricks, "compute statistics" is automatically handled. This automation is part of Databricks’ predictive optimization feature, which is enabled by default for Unity Catalog managed tables as of July 2025 for all accounts.

Key points:

- Automatic statistics collection: Unity Catalog managed tables automatically gather essential statistics, such as minimum and maximum values for columns. This enables efficient data skipping and join strategies, improving query performance and reducing computational overhead.
- Predictive optimization:*Predictive optimization automatically runs `ANALYZE` (i.e., compute statistics) on Unity Catalog managed tables. This means you do not need to manually run `ANALYZE TABLE ... COMPUTE STATISTICS` for these tables unless you want to force an update or target specific columns.
- When statistics are collected: With predictive optimization enabled, statistics are collected automatically when data is written to a managed table. The system also identifies when maintenance operations, like statistics collection, are needed and runs them as necessary.
- Manual intervention: While automatic collection is the default, you can still use the `ANALYZE TABLE` command to manually compute or refresh statistics if required for specific scenarios.

**Summary table:**

| Table Type | Statistics Collection | Manual ANALYZE Needed? |
|-----------------------------------|----------------------|------------------------------|
| Unity Catalog Managed Table | Automatic | Optional (for manual refresh)|
| Unity Catalog External Table | Partial/Manual | Often required |
| Legacy Hive Metastore Managed | Manual | Required |

Conclusion: For Unity Catalog managed tables, Databricks automatically computes statistics as part of its built-in optimization features, so manual intervention is generally unnecessary unless you have a specific need.

 

Hope this helps. Lou.

 

Unity Catalog's Managed tables are the best of all worlds. Learn how they provide lightning-fast performance and hands-free management, all while still being completely open and interoperable. Talk By: Cindy Jiang, Associate Product Manager, Databricks ; Sirui Sun, Sr. Staff Product Manager ...

Sainath368
New Contributor III

Thank you for your earlier reply—which was helpful—but I’m still looking for a clearer explanation, so I’m asking this in more detail.

I'm working with external Delta tables that don't use Predictive I/O optimization, so I set the delta.dataSkippingStatsColumns property to include only the relevant columns and ran ANALYZE TABLE <table_name> COMPUTE DELTA STATISTICS to enable data skipping for those columns. While exploring further ways to improve query performance, I found that it's recommended to run ANALYZE TABLE <table_name> COMPUTE STATISTICS periodically, especially after refreshing a significant portion of the data. I tried this on one table and noticed that it ran very quickly, returning only the row count and table size. This led me to wonder: how does this lightweight command actually help the query optimizer, how this query optimizer works and in what ways does it improve performance? Additionally, does running ANALYZE TABLE <table_name> COMPUTE STATISTICS FOR COLUMNS col1, col2, col3... (targeting the same columns used for data skipping) provide meaningful benefits over the normal version?

BigRoux
Databricks Employee
Databricks Employee

If you want to know more about query optimization I suggest you look in Spark's Catalyst Optimizer and Adaptive Query Execution (AQE).  You should alway run Analyze table compute statistics becuase this will help Spark's query optimization converge on the right execution plan via an Optimized Logical plan and then the Optimized Physial plan.  There is plenty of information out there if you really want to dig in and learn.  But the short of it is that Delta Statisics and Spark Query Optimizer do work hand in hand but there are things about Delta Statistics that only relate to Delta. Hope this help. Lou.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now