Databricks

Graham · ‎09-16-2022

Overview

To update our Data Warehouse tables, we have tried two methods: "CREATE OR REPLACE" and "MERGE". With every query we've tried, "MERGE" is slower.

My question is this:

Has anyone successfully gotten a "MERGE" to perform faster than a "CREATE OR REPLACE"?

More Details

Expected Behavior

A MERGE is normally expected to only write changes (of which there are few), and be faster, while a CREATE is expected to write all records in the query (of which there are many), and be slower. Snowflake, for example, handles MERGE statements in a fraction of the time it takes for a CREATE.

Observed Behavior

MERGES are slower.

This behavior is consistent across table size (small, large, medium datasets all exhibit this)
It is consistent across query complexity (the simplest queries exhibit this as much as do the most complicated)

Some details about our testing:

We use dbt (dbt-databricks) to compile and execute our SQL. See compiled query boilerplate below.
CREATE OR REPLACE statements look like this:

CREATE OR REPLACE TABLE <table name> 
AS <query>

MERGE statements look like this:

MERGE INTO <table name>
USING <query as view> 
   ON <primary key equality> 
WHEN MATCHED THEN UPDATE SET * 
WHEN NOT MATCHED THEN INSERT *

Other things we've tried:

Adjusting partitions and partition keys

The Question

We would love any help folks have. More than anything, we want to know: has anyone using Databricks experienced a MERGE being faster? Is this something we can realistically achieve on this platform?

byrdman · ‎09-18-2022

how much data are you dealing with , if your willing to rewrite all the data with a replace? I think when the data it getting in a very large row set , you will start seeing the merge benefits come into play. the merge has some overhead in the beginning of the batch that drop and replace does not. the larger the data your dealing with the quicker the benefits of merge will happen. this is just my thoughts on it and no way makes it correct.

Graham · ‎09-19-2022

Great question. We're doing this on tables with hundreds of millions of rows.

-werners- · ‎09-20-2022

Kinda depends on how many partitions/files have to be rewritten.

Also, if you can add file pruning, you will see a huge performance gain:

https://learn.microsoft.com/en-us/azure/databricks/optimizations/

I have cases where the merge is faster than a rebuild, but I also have many cases where I don't even use delta lake because a classic parquet overwrite is way faster.

Gazbriel · ‎10-08-2023

Same issue here. No matter the tests I have done, there is no single case in which the merge of fee records over a delta table goes faster that. The write of the entire table as parquet table.

Any light on this?

Manisha_Jena · ‎11-02-2023

Hi @Graham
Can you please try Low Shuffle Merge [LSM] and see if it helps? LSM is a new MERGE algorithm that aims to maintain the existing data organization (including z-order clustering) for unmodified data, while simultaneously improving performance.

LSM provides better performance by processing unmodified rows in a separate, more streamlined processing mode, instead of processing them together with the modified rows. As a result, the amount of shuffled data is reduced significantly, leading to improved performance.LSM also removes the need for users to re-run the OPTIMIZE ZORDER BY command after performing a MERGE operation.

Databricks

"MERGE" always slower than "CREATE OR REPLACE"

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI