Re: Skewness / Salting with countDistinct

Retired_mod · ‎08-07-2024

Hi @KosmaS, To address data skewness with `countDistinct`, you can use several techniques:

Double Aggregation involves salting the data, performing an aggregation, then removing the salt and aggregating again to reduce skewness.

HyperLogLog (HLL) provides approximate results for `countDistinct`, balancing accuracy and performance.

Broadcast Joins can help with small skewed datasets by reducing shuffle issues, while Partitioning the data based on skewed keys can distribute the load more evenly.

If you have any more details or specific constraints, feel free to share!