05-27-2025 03:50 AM
Hi everyone,
I’m trying to understand the difference between the two commands in Databricks:
ANALYZE TABLE <table_name> COMPUTE STATISTICS
ANALYZE TABLE <table_name> COMPUTE DELTA STATISTICS
Specifically:
What exactly does each command do, and how do they differ, especially in the context of Delta Lake tables?
How often is it recommended to run COMPUTE STATISTICS for optimal query performance and maintenance? For example, is running it every week a good practice, or is that unnecessary overhead?
Thanks in advance for any insights or recommendations!
05-27-2025 11:21 AM
ANALYZE TABLE <table_name> COMPUTE STATISTICS
and ANALYZE TABLE <table_name> COMPUTE DELTA STATISTICS
—serve different purposes and have distinct characteristics:ANALYZE TABLE <table_name> COMPUTE STATISTICS
:
ANALYZE TABLE <table_name> COMPUTE DELTA STATISTICS
:
DELTA
keyword is specified, it does not generate regular table statistics for the query optimizer; its focus is primarily on Delta-specific enhancements.delta.dataSkippingStatsColumns
or delta.dataSkippingNumIndexedCols
).COMPUTE STATISTICS
enhances query optimization (CBO), whereas COMPUTE DELTA STATISTICS
targets Delta-specific optimizations like file skipping. - COMPUTE DELTA STATISTICS
focuses on updating metadata in the Delta log, while COMPUTE STATISTICS
computes a broader set of metrics useful for generic query optimization.COMPUTE STATISTICS
weekly or whenever major data changes occur is a good practice for maintaining optimal query performance, as stale statistics can negatively impact query optimization. - COMPUTE DELTA STATISTICS
may not need frequent manual triggering if predictive optimization (automatic stats collection) is enabled, though it can be run to update metadata after schema changes or configuration adjustments. - It is advisable to assess your workload and data mutation patterns to calibrate the frequency of running these commands, avoiding unnecessary resource overhead.2 weeks ago
I ran ANALYZE TABLE <table_name> COMPUTE DELTA STATISTICS on my tables to calculate column-level statistics at the file level for each Parquet file inside the Delta table. Now, I’m wondering which command is better to run next: ANALYZE TABLE <table_name> COMPUTE STATISTICS or ANALYZE TABLE <table_name> COMPUTE STATISTICS FOR ALL COLUMNS? Since I already have column-level stats computed via delta statistics, I want to understand the difference and what the query optimizer uses, how it works?
2 weeks ago
I ran ANALYZE TABLE <table_name> COMPUTE DELTA STATISTICS on my tables to calculate column-level statistics at the file level for each Parquet file inside the Delta table. Now, I’m wondering which command is better to run next: ANALYZE TABLE <table_name> COMPUTE STATISTICS or ANALYZE TABLE <table_name> COMPUTE STATISTICS FOR ALL COLUMNS? Since I already have column-level stats computed via delta statistics, I want to understand the difference and what the query optimizer uses, how it works?
2 weeks ago
The Delta statistics is mostly used by the Delta Protocol to do its thing. Analyze Table ... is going to be extremely beneficial to Spark and the Catalyst Optimizer. You need both. Delta is automatic. Analyze table is done manually or, if you are using Unity Catalog with Managed tables it will be automatic there too. See below
Unity Catalog and managed tables in Databricks, "compute statistics" is automatically handled. This automation is part of Databricks’ predictive optimization feature, which is enabled by default for Unity Catalog managed tables as of July 2025 for all accounts.
Key points:
- Automatic statistics collection: Unity Catalog managed tables automatically gather essential statistics, such as minimum and maximum values for columns. This enables efficient data skipping and join strategies, improving query performance and reducing computational overhead.
- Predictive optimization:*Predictive optimization automatically runs `ANALYZE` (i.e., compute statistics) on Unity Catalog managed tables. This means you do not need to manually run `ANALYZE TABLE ... COMPUTE STATISTICS` for these tables unless you want to force an update or target specific columns.
- When statistics are collected: With predictive optimization enabled, statistics are collected automatically when data is written to a managed table. The system also identifies when maintenance operations, like statistics collection, are needed and runs them as necessary.
- Manual intervention: While automatic collection is the default, you can still use the `ANALYZE TABLE` command to manually compute or refresh statistics if required for specific scenarios.
**Summary table:**
| Table Type | Statistics Collection | Manual ANALYZE Needed? |
|-----------------------------------|----------------------|------------------------------|
| Unity Catalog Managed Table | Automatic | Optional (for manual refresh)|
| Unity Catalog External Table | Partial/Manual | Often required |
| Legacy Hive Metastore Managed | Manual | Required |
Conclusion: For Unity Catalog managed tables, Databricks automatically computes statistics as part of its built-in optimization features, so manual intervention is generally unnecessary unless you have a specific need.
Hope this helps. Lou.
a week ago
Thank you for your earlier reply—which was helpful—but I’m still looking for a clearer explanation, so I’m asking this in more detail.
I'm working with external Delta tables that don't use Predictive I/O optimization, so I set the delta.dataSkippingStatsColumns property to include only the relevant columns and ran ANALYZE TABLE <table_name> COMPUTE DELTA STATISTICS to enable data skipping for those columns. While exploring further ways to improve query performance, I found that it's recommended to run ANALYZE TABLE <table_name> COMPUTE STATISTICS periodically, especially after refreshing a significant portion of the data. I tried this on one table and noticed that it ran very quickly, returning only the row count and table size. This led me to wonder: how does this lightweight command actually help the query optimizer, how this query optimizer works and in what ways does it improve performance? Additionally, does running ANALYZE TABLE <table_name> COMPUTE STATISTICS FOR COLUMNS col1, col2, col3... (targeting the same columns used for data skipping) provide meaningful benefits over the normal version?
a week ago
If you want to know more about query optimization I suggest you look in Spark's Catalyst Optimizer and Adaptive Query Execution (AQE). You should alway run Analyze table compute statistics becuase this will help Spark's query optimization converge on the right execution plan via an Optimized Logical plan and then the Optimized Physial plan. There is plenty of information out there if you really want to dig in and learn. But the short of it is that Delta Statisics and Spark Query Optimizer do work hand in hand but there are things about Delta Statistics that only relate to Delta. Hope this help. Lou.
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now