cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

how to get unique values of a column in pyspark dataframe

satya
New Contributor

like in pandas I usually do df['columnname'].unique()

10 REPLIES 10

raela
New Contributor III
New Contributor III

df.select("columnname").distinct().show()

AbhishekYada
New Contributor II

this code returns data that's not iterable, i.e. I see the distinct data bit am not able to iterate over it in code. Any other way that enables me to do it. I tried using toPandas() to convert in it into Pandas df and then get the iterable with unique values. However, running into '' Pandas not found' error message. How can I install Pandas i my pyspark env, if my local already has Pandas running!

If you just want to print the results and not use the results for other processing, this is the way to go.

ShuminWu
New Contributor II

Hi, tried using .distinct().show() as advised, but am getting the error TypeError: 'DataFrame' object is not callable.

The dataframe was read in from a csv file using spark.read.csv, other functions like describe works on the df. any reason for this? how should I go about retrieving the list of unique values in this case?

sorry if question is very basic. noob at this. Thanks!

Rodneyjoyce
New Contributor III

To get the count of the distinct values:

df.select(F.countDistinct("colx")).show()

Or to count the number of records for each distinct value:

df.groupBy("colx").count().orderBy().show()

Thanks for this. The latter worked well for me. However, sorry for my ignorance here but what is F in the first one? The code works without the F. !The latter worked well for me.

ldfo
New Contributor II

Hi, this worked for me.

distinct_ids = [x.id for x in data.select('id').distinct().collect()]

Ger_Martinez
New Contributor II

nice way, also very "pythonic" minded

If you want to use the values to make some processing, this is the way to go.

If you want to use the results for some other data processing this is the way to go.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.