Show all distinct values per column in dataframe
Problem Statement:
I want to see all the distinct values per column for my entire table, but a SQL query with a collect_set() on every column is not dynamic and too long to write.
Use this code to show the output below:
%python
from pyspark.sql.functions import col, collect_set
distincts = df.agg(*(collect_set(col(c)).alias(c) for c in df.columns))
distincts.display()
![collect set table collect set table](/t5/image/serverpage/image-id/2405i74B6A77D665D7593/image-size/large?v=v2&px=999)