cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

ANALYZE command for stats collection - distinct_count difference for a column

Sainath368
New Contributor III

I ran the command ANALYZE TABLE <TABLE_NAME> COMPUTE STATISTICS FOR COLUMNS 'COL1,COL2' to compute statistics for specific columns, but I noticed that the distinct counts returned by this command differ from the results obtained by running a direct COUNT(DISTINCT COL1) query.  My question is does this ANALYZE TABLE <TABLE_NAME> COMPUTE STATISTICS FOR COLUMNS 'COL1,COL2' command calculate statistics accurately? If it does, what could explain the difference I’m seeing between the distinct counts from this command and the distinct counts from running a direct COUNT(DISTINCT COL1) query on the same table?

Here are the attached screenshots for reference:

describe  extended <table_name> col1

Sainath368_2-1751986880556.png

select count(distinct(col1)) from <table_name>

Sainath368_3-1751986940034.png

 

 

 

2 ACCEPTED SOLUTIONS

Accepted Solutions

sridharplv
Valued Contributor II

HI @Sainath368 and @Waldar , Please find the below table which shows the difference between each queries which are causing the difference in counts and when to use which one.

sridharplv_0-1751999054634.png

 

 

Query Type Accuracy Based On Why It Differs

SELECT DISTINCTExactFull table scanReal values
approx_count_distinct()Approximate~98%Runtime estimationProbabilistic algorithm
DESCRIBE EXTENDEDApproximate or StaleCached metadataSampled/stale or not run at all

 

Purpose Use This

Data profiling / ReportingSELECT COUNT(DISTINCT col)
Fast estimateapprox_count_distinct(col)
Metadata for query planning

ANALYZE TABLE + DESCRIBE EXTENDED

 

View solution in original post

Waldar
Databricks Employee
Databricks Employee

For larger tables, it's faster to compute the approximate count distinct rather than the exact value.

The difference of the actual number won't change the query plan - the optimizer usually choose by looking at order of magnitudes rather a specific threshold.

View solution in original post

4 REPLIES 4

Waldar
Databricks Employee
Databricks Employee

Hi Sainath368,

What would be the output of:

select approx_count_distinct(col) from <table_name>;

 ?

Sainath368
New Contributor III

Hi @Waldar , 
The approximate count matches the result from the DESCRIBE EXTENDED query. This confirms that the query you provided helps verify the count correctly. But what would be the reason?

Sainath368_0-1752043574927.png

 



Waldar
Databricks Employee
Databricks Employee

For larger tables, it's faster to compute the approximate count distinct rather than the exact value.

The difference of the actual number won't change the query plan - the optimizer usually choose by looking at order of magnitudes rather a specific threshold.

sridharplv
Valued Contributor II

HI @Sainath368 and @Waldar , Please find the below table which shows the difference between each queries which are causing the difference in counts and when to use which one.

sridharplv_0-1751999054634.png

 

 

Query Type Accuracy Based On Why It Differs

SELECT DISTINCTExactFull table scanReal values
approx_count_distinct()Approximate~98%Runtime estimationProbabilistic algorithm
DESCRIBE EXTENDEDApproximate or StaleCached metadataSampled/stale or not run at all

 

Purpose Use This

Data profiling / ReportingSELECT COUNT(DISTINCT col)
Fast estimateapprox_count_distinct(col)
Metadata for query planning

ANALYZE TABLE + DESCRIBE EXTENDED

 

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now