Databricks Community

Sourav-Kundu · ‎10-20-2024

Low Shuffle Merge in Databricks is a feature that optimizes the way data is merged when using Delta Lake, reducing the amount of data shuffled between nodes.

- Traditional merges can involve heavy data shuffling, as data is redistributed across the cluster to ensure correct merging.

- With Low Shuffle Merge, only a subset of data is shuffled, improving performance and reducing the cost of the merge operations.

Below are the benefits of Low Shuffle Merge:

1. Faster Execution: Reduces the amount of data shuffled, leading to faster merge operations.

2. Cost Efficiency: Lower shuffle operations mean less resource consumption (CPU, memory), reducing overall cloud costs.

3. Scalability: Improves the performance of merges on large datasets, enabling better scalability.

4. Better Cluster Utilization: Reduces network traffic and improves resource utilization on the cluster.

This feature is particularly useful in large-scale data processing scenarios where frequent merges are necessary, such as updating or deleting records in Delta tables.

You need to set the below for enabling this configuration
spark.databricks.delta.merge.enableLowShuffle = true

https://docs.databricks.com/en/optimizations/low-shuffle-merge.html

Advika · ‎10-28-2024

Great post, @Sourav-Kundu. The benefits you've outlined, especially regarding faster execution and cost efficiency, are valuable for anyone working with large-scale data processing. Thanks for sharing!

Databricks Community

You can use Low Shuffle Merge to optimize the Merge process in Delta lake

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks

Databricks Community Champion - December 2024 - Sujesh Menon

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences