Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-27-2025 11:21 AM
The two commands in Databricks—
ANALYZE TABLE <table_name> COMPUTE STATISTICS and ANALYZE TABLE <table_name> COMPUTE DELTA STATISTICS—serve different purposes and have distinct characteristics:-
ANALYZE TABLE <table_name> COMPUTE STATISTICS:- This command collects table-level statistics, such as the number of rows and the size in bytes. It can also collect column-level statistics such as minimum, maximum, number of nulls, and distinct values for specified or all columns.
- These statistics are utilized by the query optimizer to generate an optimal query plan, which helps improve query performance and execution.
- Column statistics are particularly useful for cost-based optimization (CBO) scenarios, helping with join ordering and other query plan optimizations.
- Recommended Usage: It is generally advised to run this command after significant data modifications (e.g., deletes, updates, or merges) or when 10% or more of the data has changed.
-
ANALYZE TABLE <table_name> COMPUTE DELTA STATISTICS:- This command recomputes file-level statistics stored in the Delta log, which are used for features like data skipping. Data skipping helps query performance by filtering out unnecessary files during execution based on column-level statistics.
- When the
DELTAkeyword is specified, it does not generate regular table statistics for the query optimizer; its focus is primarily on Delta-specific enhancements. - This command may be triggered manually, especially after adjusting properties related to data skipping (e.g., specifying columns for statistics collection through
delta.dataSkippingStatsColumnsordelta.dataSkippingNumIndexedCols).
Differences in Functionality: -
COMPUTE STATISTICS enhances query optimization (CBO), whereas COMPUTE DELTA STATISTICS targets Delta-specific optimizations like file skipping. - COMPUTE DELTA STATISTICS focuses on updating metadata in the Delta log, while COMPUTE STATISTICS computes a broader set of metrics useful for generic query optimization.Recommendations for Running These Commands: - Running
COMPUTE STATISTICS weekly or whenever major data changes occur is a good practice for maintaining optimal query performance, as stale statistics can negatively impact query optimization. - COMPUTE DELTA STATISTICS may not need frequent manual triggering if predictive optimization (automatic stats collection) is enabled, though it can be run to update metadata after schema changes or configuration adjustments. - It is advisable to assess your workload and data mutation patterns to calibrate the frequency of running these commands, avoiding unnecessary resource overhead.These distinctions ensure that both commands are used effectively based on your desired optimization goals. If predictive optimization is enabled, you might not need to run these commands frequently, as stats collection is triggered automatically for Unity Catalog managed tables.