Hi @dasiekr , Please refer to the below content that might help you -
MERGE: Under the hood
Delta Lake completes a MERGE in two steps.
- Perform an inner join between the target table and source table to select all files that have matches.
- Perform an outer join between the selected files in the target and source tables and write out the updated/deleted/inserted data.
The main way that this differs from an UPDATE or a DELETE under the hood is that Delta Lake uses joins to complete a MERGE. This fact allows us to utilize some unique strategies when seeking to improve performance.
MERGE: Performance tuning tips
To improve performance of the MERGE command, you need to determine which of the two joins that make up the merge is limiting your speed.
If the inner join is the bottleneck (i.e., finding the files that Delta Lake needs to rewrite takes too long), try the following strategies:
- Add more predicates to narrow down the search space.
- Adjust shuffle partitions.
- Adjust broadcast join thresholds.
- Compact the small files in the table if there are lots of them, but don't compact them into files that are too large, since Delta Lake has to copy the entire file to rewrite it.
On the other hand, if the outer join is the bottleneck (i.e. rewriting the actual files themselves takes too long), try the strategies below:
- Adjust shuffle partitions.
- Can generate too many small files for partitioned tables.
- Reduce files by enabling automatic repartitioning before writes (with Optimized Writes in Databricks Delta Lake)
- Adjust broadcast thresholds. If you're doing a full outer join, Spark cannot do a broadcast join, but if you're doing a right outer join, Spark can do one, and you can adjust the broadcast thresholds as needed.
- Cache the source table / DataFrame.
- Caching the source table can speed up the second scan, but be sure not to cache the target table, as this can lead to cache coherency issues.
Ajay Kumar Pandey