Low Shuffle Merge in Databricks is a feature that optimizes the way data is merged when using Delta Lake, reducing the amount of data shuffled between nodes.
- Traditional merges can involve heavy data shuffling, as data is redistributed across the cluster to ensure correct merging.
- With Low Shuffle Merge, only a subset of data is shuffled, improving performance and reducing the cost of the merge operations.
Below are the benefits of Low Shuffle Merge:
1. Faster Execution: Reduces the amount of data shuffled, leading to faster merge operations.
2. Cost Efficiency: Lower shuffle operations mean less resource consumption (CPU, memory), reducing overall cloud costs.
3. Scalability: Improves the performance of merges on large datasets, enabling better scalability.
4. Better Cluster Utilization: Reduces network traffic and improves resource utilization on the cluster.
This feature is particularly useful in large-scale data processing scenarios where frequent merges are necessary, such as updating or deleting records in Delta tables.
You need to set the below for enabling this configuration
spark.databricks.delta.merge.enableLowShuffle = true
https://docs.databricks.com/en/optimizations/low-shuffle-merge.html