Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
Showing results for 
Search instead for 
Did you mean: 

"MERGE" always slower than "CREATE OR REPLACE"

New Contributor III


To update our Data Warehouse tables, we have tried two methods: "CREATE OR REPLACE" and "MERGE". With every query we've tried, "MERGE" is slower.

My question is this:

Has anyone successfully gotten a "MERGE" to perform faster than a "CREATE OR REPLACE"?

More Details

Expected Behavior

A MERGE is normally expected to only write changes (of which there are few), and be faster, while a CREATE is expected to write all records in the query (of which there are many), and be slower. Snowflake, for example, handles MERGE statements in a fraction of the time it takes for a CREATE.

Observed Behavior

MERGES are slower.

  • This behavior is consistent across table size (small, large, medium datasets all exhibit this)
  • It is consistent across query complexity (the simplest queries exhibit this as much as do the most complicated)

Some details about our testing:

  • We use dbt (dbt-databricks) to compile and execute our SQL. See compiled query boilerplate below.
  • CREATE OR REPLACE statements look like this:
AS <query>
  • MERGE statements look like this:
MERGE INTO <table name>
USING <query as view> 
   ON <primary key equality> 

Other things we've tried:

  • Adjusting partitions and partition keys

The Question

We would love any help folks have. More than anything, we want to know: has anyone using Databricks experienced a MERGE being faster? Is this something we can realistically achieve on this platform?


New Contributor III

how much data are you dealing with , if your willing to rewrite all the data with a replace? I think when the data it getting in a very large row set , you will start seeing the merge benefits come into play. the merge has some overhead in the beginning of the batch that drop and replace does not. the larger the data your dealing with the quicker the benefits of merge will happen. this is just my thoughts on it and no way makes it correct.

New Contributor III

Great question. We're doing this on tables with hundreds of millions of rows.

Esteemed Contributor III

Kinda depends on how many partitions/files have to be rewritten.

Also, if you can add file pruning, you will see a huge performance gain:

I have cases where the merge is faster than a rebuild, but I also have many cases where I don't even use delta lake because a classic parquet overwrite is way faster.

New Contributor II

Same issue here. No matter the tests I have done, there is no single case in which the merge of fee records over a delta table goes faster that. The write of the entire table as parquet table.

Any light on this?

New Contributor III
New Contributor III

Hi @Graham 
Can you please try Low Shuffle Merge [LSM]  and see if it helps? LSM is a new
MERGE algorithm that aims to maintain the existing data organization (including z-order clustering) for unmodified data, while simultaneously improving performance.

LSM provides better performance by processing unmodified rows in a separate, more streamlined processing mode, instead of processing them together with the modified rows. As a result, the amount of shuffled data is reduced significantly, leading to improved performance.LSM also removes the need for users to re-run the OPTIMIZE ZORDER BY command after performing a MERGE operation.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!