Databricks Community

payalbhatia · ‎07-21-2024

What if I have lot of empty shuffled partitions due to data skewness

Secondly , if the shuffle partition size is 128 MB and if the size of the key's partition is 700 MB

Kaniz_Fatma · ‎07-22-2024

Empty Shuffled Partitions Due to Data Skewness

Introduce a random “salt” value to the keys to distribute the data more evenly across partitions.
Implement a custom partitioner that distributes the data more evenly based on your specific data distribution.
Use techniques like broadcasting the smaller dataset in a join operation to avoid skew.
Analyze a sample of your data to understand the distribution and adjust your partitioning strategy accordingly.

Managing Large Shuffle Partitions

If your shuffle partition size is set to 128 MB but you have a key partition size of 700 MB, you might face performance issues. Here are some ways to handle this:

Increase the number of shuffle partitions to reduce the size of each partition. You can do this by setting spark.sql.shuffle.partitions to a higher value.
Enable AQE in Spark, which can dynamically optimize the number of shuffle partitions based on the runtime statistics.
Explicitly repartition your data before the shuffle operation to ensure a more balanced distribution.
Ensure that the input file sizes are optimized to match the shuffle partition size, reducing the need for large partitions.

Would you like more detailed guidance on any of these strategies?