cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

"MERGE" always slower than "CREATE OR REPLACE"

Graham
New Contributor III

Overview

To update our Data Warehouse tables, we have tried two methods: "CREATE OR REPLACE" and "MERGE". With every query we've tried, "MERGE" is slower.

My question is this:

Has anyone successfully gotten a "MERGE" to perform faster than a "CREATE OR REPLACE"?

More Details

Expected Behavior

A MERGE is normally expected to only write changes (of which there are few), and be faster, while a CREATE is expected to write all records in the query (of which there are many), and be slower. Snowflake, for example, handles MERGE statements in a fraction of the time it takes for a CREATE.

Observed Behavior

MERGES are slower.

  • This behavior is consistent across table size (small, large, medium datasets all exhibit this)
  • It is consistent across query complexity (the simplest queries exhibit this as much as do the most complicated)

Some details about our testing:

  • We use dbt (dbt-databricks) to compile and execute our SQL. See compiled query boilerplate below.
  • CREATE OR REPLACE statements look like this:
CREATE OR REPLACE TABLE <table name> 
AS <query>
  • MERGE statements look like this:
MERGE INTO <table name>
USING <query as view> 
   ON <primary key equality> 
WHEN MATCHED THEN UPDATE SET * 
WHEN NOT MATCHED THEN INSERT *

Other things we've tried:

  • Adjusting partitions and partition keys

The Question

We would love any help folks have. More than anything, we want to know: has anyone using Databricks experienced a MERGE being faster? Is this something we can realistically achieve on this platform?

5 REPLIES 5

byrdman
New Contributor III

how much data are you dealing with , if your willing to rewrite all the data with a replace? I think when the data it getting in a very large row set , you will start seeing the merge benefits come into play. the merge has some overhead in the beginning of the batch that drop and replace does not. the larger the data your dealing with the quicker the benefits of merge will happen. this is just my thoughts on it and no way makes it correct.

Graham
New Contributor III

Great question. We're doing this on tables with hundreds of millions of rows.

-werners-
Esteemed Contributor III

Kinda depends on how many partitions/files have to be rewritten.

Also, if you can add file pruning, you will see a huge performance gain:

https://learn.microsoft.com/en-us/azure/databricks/optimizations/

I have cases where the merge is faster than a rebuild, but I also have many cases where I don't even use delta lake because a classic parquet overwrite is way faster.

Gazbriel
New Contributor II

Same issue here. No matter the tests I have done, there is no single case in which the merge of fee records over a delta table goes faster that. The write of the entire table as parquet table.

Any light on this?

Manisha_Jena
New Contributor III
New Contributor III

Hi @Graham 
Can you please try Low Shuffle Merge [LSM]  and see if it helps? LSM is a new
MERGE algorithm that aims to maintain the existing data organization (including z-order clustering) for unmodified data, while simultaneously improving performance.

LSM provides better performance by processing unmodified rows in a separate, more streamlined processing mode, instead of processing them together with the modified rows. As a result, the amount of shuffled data is reduced significantly, leading to improved performance.LSM also removes the need for users to re-run the OPTIMIZE ZORDER BY command after performing a MERGE operation.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.