Databricks Community

Sainath368 · ‎07-08-2025

I ran the command ANALYZE TABLE <TABLE_NAME> COMPUTE STATISTICS FOR COLUMNS 'COL1,COL2' to compute statistics for specific columns, but I noticed that the distinct counts returned by this command differ from the results obtained by running a direct COUNT(DISTINCT COL1) query. My question is does this ANALYZE TABLE <TABLE_NAME> COMPUTE STATISTICS FOR COLUMNS 'COL1,COL2' command calculate statistics accurately? If it does, what could explain the difference I’m seeing between the distinct counts from this command and the distinct counts from running a direct COUNT(DISTINCT COL1) query on the same table?

Here are the attached screenshots for reference:

describe extended <table_name> col1

select count(distinct(col1)) from <table_name>

sridharplv · ‎07-08-2025

HI @Sainath368 and @Waldar , Please find the below table which shows the difference between each queries which are causing the difference in counts and when to use which one.

Query Type Accuracy Based On Why It Differs

SELECT DISTINCT	Exact	✅	Full table scan	Real values
approx_count_distinct()	Approximate	~98%	Runtime estimation	Probabilistic algorithm
DESCRIBE EXTENDED	Approximate or Stale	❌	Cached metadata	Sampled/stale or not run at all

Purpose Use This

Data profiling / Reporting	SELECT COUNT(DISTINCT col)
Fast estimate	approx_count_distinct(col)
Metadata for query planning	ANALYZE TABLE + DESCRIBE EXTENDED

View solution in original post

Waldar · ‎07-09-2025

For larger tables, it's faster to compute the approximate count distinct rather than the exact value.

The difference of the actual number won't change the query plan - the optimizer usually choose by looking at order of magnitudes rather a specific threshold.

View solution in original post

Waldar · ‎07-08-2025

Hi Sainath368,

What would be the output of:

select approx_count_distinct(col) from <table_name>;

?

Sainath368 · ‎07-08-2025

Hi @Waldar ,
The approximate count matches the result from the DESCRIBE EXTENDED query. This confirms that the query you provided helps verify the count correctly. But what would be the reason?

Waldar · ‎07-09-2025

For larger tables, it's faster to compute the approximate count distinct rather than the exact value.

The difference of the actual number won't change the query plan - the optimizer usually choose by looking at order of magnitudes rather a specific threshold.

sridharplv · ‎07-08-2025

HI @Sainath368 and @Waldar , Please find the below table which shows the difference between each queries which are causing the difference in counts and when to use which one.

Query Type Accuracy Based On Why It Differs

SELECT DISTINCT	Exact	✅	Full table scan	Real values
approx_count_distinct()	Approximate	~98%	Runtime estimation	Probabilistic algorithm
DESCRIBE EXTENDED	Approximate or Stale	❌	Cached metadata	Sampled/stale or not run at all

Purpose Use This

Data profiling / Reporting	SELECT COUNT(DISTINCT col)
Fast estimate	approx_count_distinct(col)
Metadata for query planning	ANALYZE TABLE + DESCRIBE EXTENDED

Databricks Community

ANALYZE command for stats collection - distinct_count difference for a column

Join Us as a Local Community Builder!

🎬 Databricks Community 2025 Highlights | A Year, Built Together

🌟 Community Pulse: Your Weekly Roundup! December 22, 2025 – January 04, 2026

Solution Accelerator Series | Scale cybersecurity analytics with Splunk and Databricks

🎤 Call for Presentations: Data + AI Summit 2026 is Open!

Self-Paced Learning Festival: 09 January - 30 January 2026