01-26-2023 06:20 AM
I'm new to databricks. (Not new to DB's - 10+ year DB Developer).
How do you generate a MERGE statement in DataBricks?
Trying to manually maintain a 500+ or 1000+ lines in a MERGE statement doesn't make much sense? Working with Large Tables of between 200 - 500 columns.
01-27-2023 06:13 AM
don't know about order.
However, I always prepare the incoming data so that it has the same schema as the target. This makes merges easy. You indeed do not want to tinker around in a merge statement and typing tons of columns.
Using scala/python it is practically always possible to prepare your data.
Takes some time to learn, but it is worth it.
01-26-2023 08:40 AM
In my opinion, when possible MERGE statement should be on the primary key. If not possible you can create your own unique key (by concatenate some fields and eventually hashing them) and then use it in merge logic.
01-26-2023 08:48 AM
Thanks.. but that's not really what I'm asking...
How many columns can a MERGE statement manage, before maintenance becomes a nightmare?
01-26-2023 08:54 AM
It was tested with up to 4000 columns. Here are tests up to 1000 columns after the update in December 2021 https://github.com/delta-io/delta/pull/584
Also, remember that stats are precalculated for the first 32 columns (you can change it to more in settings). So it would be good to have fields on which you merge conditions in the first 32 columns.
01-27-2023 12:19 AM
Interesting...I had not yet considered the performance issues. I didn't think there would be any .. will clearly need to come back to this one... 🙂
I'm currently only concerned with the maintenance of a piece of code (the 500+ lines of the MERGE statement). Do you just "eyeball" the changes and hope for the best, or is there a more structured way to maintain large MERGE statements?
01-27-2023 05:31 AM
I am still wondering what you mean by 500+ lines of code for a merge.
do you mean the list of columns which should be updated?
If you want to update a subset of columns that can become cumbersome indeed. But with some coding in scala/python you can create a list of column names which you can then pass to the query.
If you want to update all, use *
MERGE INTO table
USING updates
ON mgline.RegistrationYear IN ($yearlist) AND
table.key1 = updates.key1 and
table.key2 = updates.key2 and
table.key3 = updates.key3
WHEN MATCHED THEN
UPDATE SET *
WHEN NOT MATCHED THEN
INSERT *
01-27-2023 05:57 AM
Yes, I mean the list of columns becomes large. To then maintain a MERGE statement could be very cumbersome.
Would you happen to have an example of such Python code? This does actually make sense.. if such a list was generated dynamically, and then used for the UPD/INS statements..
01-27-2023 06:01 AM
I don't have an example at hand, but if you can do a df.columns, that gives you all the cols of the table (in dataframe format of course), then depending on the case you can drop columns or keep a few or ... and then try to use that list for the merge.
TBH I never did that though, I always use update * to avoid the hassle.
01-27-2023 06:10 AM
Would UPDATE SET * not require the the source and target columns to have the same names, and the same column order ?
01-27-2023 06:13 AM
don't know about order.
However, I always prepare the incoming data so that it has the same schema as the target. This makes merges easy. You indeed do not want to tinker around in a merge statement and typing tons of columns.
Using scala/python it is practically always possible to prepare your data.
Takes some time to learn, but it is worth it.
01-27-2023 06:24 AM
Thanks, this does make sense.
I have a new lead to chase .. 🙂
Much appreciated. 😊
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group