cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Delta Upsert performance on empty table

pantelis_mare
Contributor III

Hello all,

I was just wandering, performance wise how does it compare a plain write operation with a merge operation on an EMPTY delta table. Do we really risk to get significant performance drop?

The use case would be to have the same pipeline for initial and incremental load.

Thank you in advance!

1 ACCEPTED SOLUTION

Accepted Solutions

-werners-
Esteemed Contributor III

I haven't tested this but I think the merge will be slower. A merge will check the merge conditions which will take extra time compared to the blind append.

So unless delta lake has a check built in to check on empty tables, my guess is it will be slower.

By how much? No idea, I don't think it will be by a lot. But a blind append is of course wicked fast.

If you want a single script for init/incremental, you can do a check on a count of records in the target table. If that is 0, pass a different query.

I like to keep them separated though.

View solution in original post

7 REPLIES 7

-werners-
Esteemed Contributor III

I haven't tested this but I think the merge will be slower. A merge will check the merge conditions which will take extra time compared to the blind append.

So unless delta lake has a check built in to check on empty tables, my guess is it will be slower.

By how much? No idea, I don't think it will be by a lot. But a blind append is of course wicked fast.

If you want a single script for init/incremental, you can do a check on a count of records in the target table. If that is 0, pass a different query.

I like to keep them separated though.

pantelis_mare
Contributor III

Hello @Werner Stinckens​ !

Thanks for taking the time to answer. The question is rather how much slower would it be. My understanding is that during merge there is a first step of identifying what rows are to be insterted (simple append), thus if this step's duration is minimal (order of seconds) because of having an empty delta table, then it practically does not make any sense to increase our codebase.

If somebody has seen any tests/benchmarks I would appreciate!

-werners-
Esteemed Contributor III

In case no one has benchmarks/tests, you could try it yourself? Create an empty delta table and try with merge and append.

My guess is that it will be a matter of seconds, as you already mentioned.

I have not seen benchmarks for this but I would imagine that a merge would not be much slower than a pure insert because the merge would quickly identify that all rows would need to be inserted.

Thank you @Ryan Chynoweth (Databricks)​ . This is what I imagine as well. Will be doing a benchmark in the following days and will post the findings

Yes, please share your results on this post so that future users can get the answer too!

pantelis_mare
Contributor III

Hello @Kaniz Fatma​ ,

Unfortunately I did not do any further investigation on the subject. Given that the merge on an empty table will only be done once at the creation of a table, it wouldn't really matter to be honest.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group