07-08-2025 08:07 AM
I ran the command ANALYZE TABLE <TABLE_NAME> COMPUTE STATISTICS FOR COLUMNS 'COL1,COL2' to compute statistics for specific columns, but I noticed that the distinct counts returned by this command differ from the results obtained by running a direct COUNT(DISTINCT COL1) query. My question is does this ANALYZE TABLE <TABLE_NAME> COMPUTE STATISTICS FOR COLUMNS 'COL1,COL2' command calculate statistics accurately? If it does, what could explain the difference I’m seeing between the distinct counts from this command and the distinct counts from running a direct COUNT(DISTINCT COL1) query on the same table?
Here are the attached screenshots for reference:
describe extended <table_name> col1
select count(distinct(col1)) from <table_name>
07-08-2025 11:25 AM
HI @Sainath368 and @Waldar , Please find the below table which shows the difference between each queries which are causing the difference in counts and when to use which one.
Query Type Accuracy Based On Why It Differs
SELECT DISTINCT | Exact | ✅ | Full table scan | Real values |
approx_count_distinct() | Approximate | ~98% | Runtime estimation | Probabilistic algorithm |
DESCRIBE EXTENDED | Approximate or Stale | ❌ | Cached metadata | Sampled/stale or not run at all |
Purpose Use This
Data profiling / Reporting | SELECT COUNT(DISTINCT col) |
Fast estimate | approx_count_distinct(col) |
Metadata for query planning | ANALYZE TABLE + DESCRIBE EXTENDED |
07-09-2025 02:40 AM
For larger tables, it's faster to compute the approximate count distinct rather than the exact value.
The difference of the actual number won't change the query plan - the optimizer usually choose by looking at order of magnitudes rather a specific threshold.
07-08-2025 10:36 AM - edited 07-08-2025 10:36 AM
Hi Sainath368,
What would be the output of:
select approx_count_distinct(col) from <table_name>;
?
07-08-2025 11:51 PM
Hi @Waldar ,
The approximate count matches the result from the DESCRIBE EXTENDED query. This confirms that the query you provided helps verify the count correctly. But what would be the reason?
07-09-2025 02:40 AM
For larger tables, it's faster to compute the approximate count distinct rather than the exact value.
The difference of the actual number won't change the query plan - the optimizer usually choose by looking at order of magnitudes rather a specific threshold.
07-08-2025 11:25 AM
HI @Sainath368 and @Waldar , Please find the below table which shows the difference between each queries which are causing the difference in counts and when to use which one.
Query Type Accuracy Based On Why It Differs
SELECT DISTINCT | Exact | ✅ | Full table scan | Real values |
approx_count_distinct() | Approximate | ~98% | Runtime estimation | Probabilistic algorithm |
DESCRIBE EXTENDED | Approximate or Stale | ❌ | Cached metadata | Sampled/stale or not run at all |
Purpose Use This
Data profiling / Reporting | SELECT COUNT(DISTINCT col) |
Fast estimate | approx_count_distinct(col) |
Metadata for query planning | ANALYZE TABLE + DESCRIBE EXTENDED |
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now