cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

ANALYZE TABLE showing NULLs for all statistics in Spark

chhavibansal
New Contributor III
var df2 = spark.read
    .format("csv")
    .option("sep", ",")
    .option("header", "true")
    .option("inferSchema", "true")
    .load("src/main/resources/datasets/titanic.csv")
 
df2.createOrReplaceTempView("titanic")
spark.table("titanic").cache()
 
spark.sql("Analyze table titanic compute statistics for all columns")
spark.sql("desc extended titanic Name").show(100, false)

I have created a spark session, imported a dataset and then trying to register it as a temp table, upon using analyze command i gett all statistics value as NULL for all column.

+--------------+----------+
|info_name     |info_value|
+--------------+----------+
|col_name      |Name      |
|data_type     |string    |
|comment       |NULL      |
|min           |NULL      |
|max           |NULL      |
|num_nulls     |NULL      |
|distinct_count|NULL      |
|avg_col_len   |NULL      |
|max_col_len   |NULL      |
|histogram     |NULL      |
+--------------+----------+

Can someone suggest what is it that i am doing wrong.

The thing I noticed is if i make a new table

 spark.sql("create table newtitanic as select * from titanic")
spark.sql("Analyze table newtitanic compute statistics for all columns")
spark.sql("desc extended newtitanic Name").show(130, false)

this will fetch me statistics for all columns.

4 REPLIES 4

Aviral-Bhardwaj
Esteemed Contributor III

hey ,

I have testing this but it is working fine for me, can you please share the data set link by that we can test and provide you better solution

Here is my snapshot that what result I got

image 

AviralBhardwaj

chhavibansal
New Contributor III

Hi @Aviral Bhardwaj​ 

Thank you for the answer.

My question is more about using analyze table command followed by describe extended on the temp view that is created. you are using the right dataset as shared in the ss. I have shared all the sequence of commands which lead to the state of getting null stats.

Aviral-Bhardwaj
Esteemed Contributor III

@Chhavi Bansal​ 

it is happening because you are using specifically Name column while describing

so see this

image 

I hope you got some idea here

Thanks

Aviral Bhardwaj

AviralBhardwaj

chhavibansal
New Contributor III

can you share what the *newtitanic* is I think that you would have done something similar

spark.sql("create table newtitanic as select * from titanic")

something like this works for me, but the issue is i first make a temp view then again create a table which would be persisted in memory.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group