Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-19-2026 11:43 PM
To handle PB-scale data, the most common "killer" is Data Skew. This happens when one join key (like a null value or a "Power User" ID) has millions of rows while others have only a few.
Even with Spark’s optimization, one executor will get buried while the others sit idle. The solution is Salting.
1. The Problem: Standard Join
In a standard join, Spark hashes the join key. If ID: 101 appears 1 billion times, all those rows go to one partition.
2. The Fix: Salting Technique
We manually break the "heavy" key into smaller pieces by adding a random "salt" (a suffix).