Databricks Community

KosmaS · ‎08-05-2024

Hey Everyone,

I experience data skewness for:

df = (source_df
.unionByName(source_df.withColumn("region", lit("Country")))
.groupBy("zip_code", "region", "device_type")
.agg(countDistinct("device_id").alias("total_active_unique"), count("device_id").alias("total_active"))

Stats:

Is there a way to work with data skewness where I need to calculate a countDistinct in aggregation and avoid affecting the results?

I understand how to work with data skewness by adding salting, but it seems to be fine with count.
But when countDistinct comes to the picture salting seems to be affecting the results.
Is there some tricky way to still apply salting and secure deterministic results for countDistinct?
Or is there some other approach in such case to be applied for data skewness?

KosmaS · ‎08-22-2024

Hey @Retired_mod

thanks for the reply. I tried to spend some time on your response.

You're suggesting 'double aggregation' and as I'd be guessing it should look more or less this way:

df = (source_df
.unionByName(source_df.withColumn("region", lit("Country")))
.groupBy("zip_code", "region", "device_type", "salt")
.agg(countDistinct("device_id").alias("total_active_unique"), count("device_id").alias("total_active"))
.groupBy("zip_code", "region", "device_type")
.agg(countDistinct("device_id").alias("total_active_unique"), count("device_id").alias("total_active"))

So I can't see how countDistinct value won't be affected by salt. It'll be affected at the first step (with salt), so the second step will have inaccurate results. Or should this be done a bit differently? And did you mean something else?

Databricks Community

Skewness / Salting with countDistinct

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks

Databricks Community Champion - December 2024 - Sujesh Menon

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences