cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Will MERGE incur a lot driver memory

Brad
Contributor II

Hi team,

We have a job to run MERGE on a target table with around 220 million rows. We found it needs a lot driver memory (just for MERGE itself). From the job metrics we can see the MERGE needs at least 46GB memory. Is there some special thing to make MERGE using a lot driver memory?

Thanks

3 REPLIES 3

filipniziol
Contributor III

Hi @Brad ,

In general the MERGE statement can be memory-intensive.

To optimize check this article:
https://medium.com/@abhinav.prakash1804/databricks-a-comprehensive-optimization-guide-7b803289c5a7

If you need an additional help, could you write:
1. Are you partitioning/clustering the data by the columns used in JOIN? If yes, how many partitions do you have?

2. What is the average file size of source and target (run DESCRIBE DETAILS on source and target tables)?

3. When doing the merge to the target of 220 million, how many records are in the source?

Thanks for response.

1. The target table is around 14GB, based on best practice from databricks, if table is less than 1TB we don't add partitions. 

2.Both source and target are delta tables. We enabled  spark.databricks.delta.optimizeWrite.enabled as true, so the target table file size is around 128M.Source table is 102GB in around 250million rows.  But the source in MERGE is a delta incremental reading from a source table. The reading is based on checkpoint so I think likely source table reading doesn't have issue (but source table does exist small file issues from 10kb to dozens of mb).

3. We want to know what MERGE needs to do behind the scene to lead a high driver memory usage. 

filipniziol
Contributor III

Hi @Brad ,

Could you try to apply very standard optimization practices and check the outcome:

1. If your runtime is greater equal 15.2, could you implement liquid clustering on the source and target tables using JOIN columns?

ALTER TABLE <table_name>
CLUSTER BY (<clustering_columns>)

2. If your runtime is less than 15.2, could you run OPTIMIZE on source and target tables with ZORDER BY JOIN columns.

OPTIMIZE <table_name>
ZORDER BY (<column1>, <column2>)

What I believe is happening the problem is in source table, it is much bigger than target, not-partitioned, not optimized, with a lot of small files. Even if you have the check-pointing mechanism, all those files needs to be scanned to select the incremental data. If your source is 100 GB, on average the file is 100 KB, then it is much smaller than recommended 100 MB.

You have 1 million of files, all the metadata for these files (file paths, sizes, locations) will be stored in memory on the driver node. The driver will track all those files.

In summary, based on the symptoms, If you cannot use liquid clustering I would first run the OPTIMIZE with ZORDER-ing on both source and target and check if it solves the issue. If it does, then your problem is caused by the number of small files. 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group