Hey @Retired_mod
thanks for the reply. I tried to spend some time on your response.
You're suggesting 'double aggregation' and as I'd be guessing it should look more or less this way:
df = (source_df
.unionByName(source_df.withColumn("region", lit("Country")))
.groupBy("zip_code", "region", "device_type", "salt")
.agg(countDistinct("device_id").alias("total_active_unique"), count("device_id").alias("total_active"))
.groupBy("zip_code", "region", "device_type")
.agg(countDistinct("device_id").alias("total_active_unique"), count("device_id").alias("total_active"))
So I can't see how countDistinct value won't be affected by salt. It'll be affected at the first step (with salt), so the second step will have inaccurate results. Or should this be done a bit differently? And did you mean something else?