Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-07-2024 06:30 AM
Hi @KosmaS, To address data skewness with `countDistinct`, you can use several techniques:
Double Aggregation involves salting the data, performing an aggregation, then removing the salt and aggregating again to reduce skewness.
HyperLogLog (HLL) provides approximate results for `countDistinct`, balancing accuracy and performance.
Broadcast Joins can help with small skewed datasets by reducing shuffle issues, while Partitioning the data based on skewed keys can distribute the load more evenly.
If you have any more details or specific constraints, feel free to share!