Overview
To update our Data Warehouse tables, we have tried two methods: "CREATE OR REPLACE" and "MERGE". With every query we've tried, "MERGE" is slower.
My question is this:
Has anyone successfully gotten a "MERGE" to perform faster than a "CREATE OR REPLACE"?
More Details
Expected Behavior
A MERGE is normally expected to only write changes (of which there are few), and be faster, while a CREATE is expected to write all records in the query (of which there are many), and be slower. Snowflake, for example, handles MERGE statements in a fraction of the time it takes for a CREATE.
Observed Behavior
MERGES are slower.
- This behavior is consistent across table size (small, large, medium datasets all exhibit this)
- It is consistent across query complexity (the simplest queries exhibit this as much as do the most complicated)
Some details about our testing:
- We use dbt (dbt-databricks) to compile and execute our SQL. See compiled query boilerplate below.
- CREATE OR REPLACE statements look like this:
CREATE OR REPLACE TABLE <table name>
AS <query>
- MERGE statements look like this:
MERGE INTO <table name>
USING <query as view>
ON <primary key equality>
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
Other things we've tried:
- Adjusting partitions and partition keys
The Question
We would love any help folks have. More than anything, we want to know: has anyone using Databricks experienced a MERGE being faster? Is this something we can realistically achieve on this platform?