Databricks Community

Henrik_ · ‎09-02-2024

On a spark dataframe, is there any smart way to set the order of a categorical feature explicitly, equivalent to Categorical(ordered=list) in Pandas? The use case here is a dashboard in Databricks, and I want the bars to be arranged in certain order.

holly · ‎09-03-2024

Hi there, you can use a map function. Create a map with the creatively named create_map, and then sort by the values in the map.

The code will look sooooomething like this (although not tested this to take it as pseudo code)

from pyspark.sql.functions import create_map, lit, col

categories=['small', 'medium', 'large', 'xlarge']

map = create_map([val for (i, category_col) in enumerate(categories) for val in (category_col, lit(i))])#gives <'map(small, 0, medium, 1, large, 2, xlarge, 3)'> display(df.orderBy(map[col('category_col')]))

Henrik_ · ‎09-03-2024

Thanks! One question, this code will order the whole dataframe based on the logic from create_map. However, I want to put on several figures, all with their own sorting logic, on display in a dashboard. I don' think this method will work for that use-case?

holly · ‎09-04-2024

Ah, I think I see. Let's say your dataset has category_col1 with {S, M, L, XL} values, then category_col2 with {XS, S M} and you want to sort the data by category_col1 and category_col2.

If you want to specify the order for the user, you can duplicate the create_map step with and make map_1 and map_2 and then order by two columns. You can do this as part of your pipeline and save the results to your table so it's not only available as part of the dataframe.

BUT

If you want the end user to be able to sort the end Databricks visualisation / table by clicking values that's something we don't have at the moment. I think it's a sensible ask so I'll raise this with our BI team.