how to get unique values of a column in pyspark dataframe
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-08-2016 12:01 AM
like in pandas I usually do df['columnname'].unique()
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-04-2017 05:07 PM
df.select("columnname").distinct().show()
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-16-2018 05:14 PM
this code returns data that's not iterable, i.e. I see the distinct data bit am not able to iterate over it in code. Any other way that enables me to do it. I tried using toPandas() to convert in it into Pandas df and then get the iterable with unique values. However, running into '' Pandas not found' error message. How can I install Pandas i my pyspark env, if my local already has Pandas running!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-06-2021 10:45 PM
If you just want to print the results and not use the results for other processing, this is the way to go.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-14-2017 08:14 PM
Hi, tried using .distinct().show() as advised, but am getting the error TypeError: 'DataFrame' object is not callable.
The dataframe was read in from a csv file using spark.read.csv, other functions like describe works on the df. any reason for this? how should I go about retrieving the list of unique values in this case?
sorry if question is very basic. noob at this. Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-22-2019 04:18 AM
To get the count of the distinct values:
df.select(F.countDistinct("colx")).show()
Or to count the number of records for each distinct value:
df.groupBy("colx").count().orderBy().show()
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-22-2020 01:36 PM
Thanks for this. The latter worked well for me. However, sorry for my ignorance here but what is F in the first one? The code works without the F. !The latter worked well for me.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-01-2020 03:05 PM
Hi, this worked for me.
distinct_ids = [x.id for x in data.select('id').distinct().collect()]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-02-2020 08:03 AM
nice way, also very "pythonic" minded
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-06-2021 10:43 PM
If you want to use the values to make some processing, this is the way to go.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-06-2021 10:46 PM
If you want to use the results for some other data processing this is the way to go.

