Databricks Community

Brad · ‎09-25-2024

Hi team,

We have a job to run MERGE on a target table with around 220 million rows. We found it needs a lot driver memory (just for MERGE itself). From the job metrics we can see the MERGE needs at least 46GB memory. Is there some special thing to make MERGE using a lot driver memory?

Thanks

filipniziol · ‎09-25-2024

Hi @Brad ,

In general the MERGE statement can be memory-intensive.

To optimize check this article:
https://medium.com/@abhinav.prakash1804/databricks-a-comprehensive-optimization-guide-7b803289c5a7

If you need an additional help, could you write:
1. Are you partitioning/clustering the data by the columns used in JOIN? If yes, how many partitions do you have?

2. What is the average file size of source and target (run DESCRIBE DETAILS on source and target tables)?

3. When doing the merge to the target of 220 million, how many records are in the source?

Brad · ‎09-25-2024

Thanks for response.

1. The target table is around 14GB, based on best practice from databricks, if table is less than 1TB we don't add partitions.

2.Both source and target are delta tables. We enabled spark.databricks.delta.optimizeWrite.enabled as true, so the target table file size is around 128M.Source table is 102GB in around 250million rows. But the source in MERGE is a delta incremental reading from a source table. The reading is based on checkpoint so I think likely source table reading doesn't have issue (but source table does exist small file issues from 10kb to dozens of mb).

3. We want to know what MERGE needs to do behind the scene to lead a high driver memory usage.

filipniziol · ‎09-25-2024

Hi @Brad ,

Could you try to apply very standard optimization practices and check the outcome:

1. If your runtime is greater equal 15.2, could you implement liquid clustering on the source and target tables using JOIN columns?

ALTER TABLE <table_name>
CLUSTER BY (<clustering_columns>)

2. If your runtime is less than 15.2, could you run OPTIMIZE on source and target tables with ZORDER BY JOIN columns.

OPTIMIZE <table_name>
ZORDER BY (<column1>, <column2>)

What I believe is happening the problem is in source table, it is much bigger than target, not-partitioned, not optimized, with a lot of small files. Even if you have the check-pointing mechanism, all those files needs to be scanned to select the incremental data. If your source is 100 GB, on average the file is 100 KB, then it is much smaller than recommended 100 MB.

You have 1 million of files, all the metadata for these files (file paths, sizes, locations) will be stored in memory on the driver node. The driver will track all those files.

In summary, based on the symptoms, If you cannot use liquid clustering I would first run the OPTIMIZE with ZORDER-ing on both source and target and check if it solves the issue. If it does, then your problem is caused by the number of small files.

Databricks Community

Will MERGE incur a lot driver memory

Join Us as a Local Community Builder!

🌟 Community Pulse: Your Weekly Roundup! November 28 – December 04, 2025

Lakehouse, Lagers & Legends — Bangalore Meetup | December 13

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples