cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Large MERGE Statements - 500+ lines of code!

StevenW
New Contributor III

I'm new to databricks. (Not new to DB's - 10+ year DB Developer).

How do you generate a MERGE statement in DataBricks? 

Trying to manually maintain a 500+ or 1000+ lines in a MERGE statement doesn't make much sense? Working with Large Tables of between 200 - 500 columns.

1 ACCEPTED SOLUTION

Accepted Solutions

-werners-
Esteemed Contributor III

don't know about order.

However, I always prepare the incoming data so that it has the same schema as the target. This makes merges easy. You indeed do not want to tinker around in a merge statement and typing tons of columns.

Using scala/python it is practically always possible to prepare your data.

Takes some time to learn, but it is worth it.

View solution in original post

10 REPLIES 10

Hubert-Dudek
Esteemed Contributor III

In my opinion, when possible MERGE statement should be on the primary key. If not possible you can create your own unique key (by concatenate some fields and eventually hashing them) and then use it in merge logic.

StevenW
New Contributor III

Thanks.. but that's not really what I'm asking...

How many columns can a MERGE statement manage, before maintenance becomes a nightmare?

Hubert-Dudek
Esteemed Contributor III

It was tested with up to 4000 columns. Here are tests up to 1000 columns after the update in December 2021 https://github.com/delta-io/delta/pull/584

Also, remember that stats are precalculated for the first 32 columns (you can change it to more in settings). So it would be good to have fields on which you merge conditions in the first 32 columns.

image.png 

StevenW
New Contributor III

Interesting...I had not yet considered the performance issues. I didn't think there would be any .. will clearly need to come back to this one... ๐Ÿ™‚

I'm currently only concerned with the maintenance of a piece of code (the 500+ lines of the MERGE statement). Do you just "eyeball" the changes and hope for the best, or is there a more structured way to maintain large MERGE statements?

-werners-
Esteemed Contributor III

I am still wondering what you mean by 500+ lines of code for a merge.

do you mean the list of columns which should be updated?

If you want to update a subset of columns that can become cumbersome indeed. But with some coding in scala/python you can create a list of column names which you can then pass to the query.

If you want to update all, use *

MERGE INTO table 
  USING updates 
  ON mgline.RegistrationYear IN ($yearlist) AND
     table.key1 = updates.key1 and 
     table.key2 = updates.key2 and
     table.key3 = updates.key3
  WHEN MATCHED THEN 
     UPDATE SET * 
  WHEN NOT MATCHED THEN 
     INSERT *

StevenW
New Contributor III

Yes, I mean the list of columns becomes large. To then maintain a MERGE statement could be very cumbersome.

Would you happen to have an example of such Python code? This does actually make sense.. if such a list was generated dynamically, and then used for the UPD/INS statements..

-werners-
Esteemed Contributor III

I don't have an example at hand, but if you can do a df.columns, that gives you all the cols of the table (in dataframe format of course), then depending on the case you can drop columns or keep a few or ... and then try to use that list for the merge.

TBH I never did that though, I always use update * to avoid the hassle.

StevenW
New Contributor III

Would UPDATE SET * not require the the source and target columns to have the same names, and the same column order ?

-werners-
Esteemed Contributor III

don't know about order.

However, I always prepare the incoming data so that it has the same schema as the target. This makes merges easy. You indeed do not want to tinker around in a merge statement and typing tons of columns.

Using scala/python it is practically always possible to prepare your data.

Takes some time to learn, but it is worth it.

StevenW
New Contributor III

Thanks, this does make sense.

I have a new lead to chase .. ๐Ÿ™‚

Much appreciated. ๐Ÿ˜Š

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group