Skewness / Salting with countDistinct

KosmaS · ‎08-05-2024

Hey Everyone,

I experience data skewness for:

df = (source_df
.unionByName(source_df.withColumn("region", lit("Country")))
.groupBy("zip_code", "region", "device_type")
.agg(countDistinct("device_id").alias("total_active_unique"), count("device_id").alias("total_active"))

Stats:

Is there a way to work with data skewness where I need to calculate a countDistinct in aggregation and avoid affecting the results?

I understand how to work with data skewness by adding salting, but it seems to be fine with count.
But when countDistinct comes to the picture salting seems to be affecting the results.
Is there some tricky way to still apply salting and secure deterministic results for countDistinct?
Or is there some other approach in such case to be applied for data skewness?