09-16-2022 10:41 AM
Overview
To update our Data Warehouse tables, we have tried two methods: "CREATE OR REPLACE" and "MERGE". With every query we've tried, "MERGE" is slower.
My question is this:
Has anyone successfully gotten a "MERGE" to perform faster than a "CREATE OR REPLACE"?
More Details
Expected Behavior
A MERGE is normally expected to only write changes (of which there are few), and be faster, while a CREATE is expected to write all records in the query (of which there are many), and be slower. Snowflake, for example, handles MERGE statements in a fraction of the time it takes for a CREATE.
Observed Behavior
MERGES are slower.
Some details about our testing:
CREATE OR REPLACE TABLE <table name>
AS <query>
MERGE INTO <table name>
USING <query as view>
ON <primary key equality>
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
Other things we've tried:
The Question
We would love any help folks have. More than anything, we want to know: has anyone using Databricks experienced a MERGE being faster? Is this something we can realistically achieve on this platform?
09-18-2022 10:44 AM
how much data are you dealing with , if your willing to rewrite all the data with a replace? I think when the data it getting in a very large row set , you will start seeing the merge benefits come into play. the merge has some overhead in the beginning of the batch that drop and replace does not. the larger the data your dealing with the quicker the benefits of merge will happen. this is just my thoughts on it and no way makes it correct.
09-19-2022 08:29 AM
Great question. We're doing this on tables with hundreds of millions of rows.
09-20-2022 12:54 AM
Kinda depends on how many partitions/files have to be rewritten.
Also, if you can add file pruning, you will see a huge performance gain:
https://learn.microsoft.com/en-us/azure/databricks/optimizations/
I have cases where the merge is faster than a rebuild, but I also have many cases where I don't even use delta lake because a classic parquet overwrite is way faster.
10-08-2023 03:23 AM
Same issue here. No matter the tests I have done, there is no single case in which the merge of fee records over a delta table goes faster that. The write of the entire table as parquet table.
Any light on this?
11-02-2023 02:18 AM
Hi @Graham
Can you please try Low Shuffle Merge [LSM] and see if it helps? LSM is a new MERGE algorithm that aims to maintain the existing data organization (including z-order clustering) for unmodified data, while simultaneously improving performance.
LSM provides better performance by processing unmodified rows in a separate, more streamlined processing mode, instead of processing them together with the modified rows. As a result, the amount of shuffled data is reduced significantly, leading to improved performance.LSM also removes the need for users to re-run the OPTIMIZE ZORDER BY command after performing a MERGE operation.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group