cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

how to get unique values of a column in pyspark dataframe

satya
New Contributor

like in pandas I usually do df['columnname'].unique()

10 REPLIES 10

raela
Databricks Employee
Databricks Employee

df.select("columnname").distinct().show()

AbhishekYada
New Contributor II

this code returns data that's not iterable, i.e. I see the distinct data bit am not able to iterate over it in code. Any other way that enables me to do it. I tried using toPandas() to convert in it into Pandas df and then get the iterable with unique values. However, running into '' Pandas not found' error message. How can I install Pandas i my pyspark env, if my local already has Pandas running!

If you just want to print the results and not use the results for other processing, this is the way to go.

ShuminWu
New Contributor II

Hi, tried using .distinct().show() as advised, but am getting the error TypeError: 'DataFrame' object is not callable.

The dataframe was read in from a csv file using spark.read.csv, other functions like describe works on the df. any reason for this? how should I go about retrieving the list of unique values in this case?

sorry if question is very basic. noob at this. Thanks!

Rodneyjoyce
New Contributor III

To get the count of the distinct values:

df.select(F.countDistinct("colx")).show()

Or to count the number of records for each distinct value:

df.groupBy("colx").count().orderBy().show()

Thanks for this. The latter worked well for me. However, sorry for my ignorance here but what is F in the first one? The code works without the F. !The latter worked well for me.

ldfo
New Contributor II

Hi, this worked for me.

distinct_ids = [x.id for x in data.select('id').distinct().collect()]

Ger_Martinez
New Contributor II

nice way, also very "pythonic" minded

If you want to use the values to make some processing, this is the way to go.

If you want to use the results for some other data processing this is the way to go.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group