Databricks

StevenW · ‎01-26-2023

I'm new to databricks. (Not new to DB's - 10+ year DB Developer).

How do you generate a MERGE statement in DataBricks?

Trying to manually maintain a 500+ or 1000+ lines in a MERGE statement doesn't make much sense? Working with Large Tables of between 200 - 500 columns.

-werners- · ‎01-27-2023

don't know about order.

However, I always prepare the incoming data so that it has the same schema as the target. This makes merges easy. You indeed do not want to tinker around in a merge statement and typing tons of columns.

Using scala/python it is practically always possible to prepare your data.

Takes some time to learn, but it is worth it.

View solution in original post

Hubert-Dudek · ‎01-26-2023

In my opinion, when possible MERGE statement should be on the primary key. If not possible you can create your own unique key (by concatenate some fields and eventually hashing them) and then use it in merge logic.

StevenW · ‎01-26-2023

Thanks.. but that's not really what I'm asking...

How many columns can a MERGE statement manage, before maintenance becomes a nightmare?

Hubert-Dudek · ‎01-26-2023

It was tested with up to 4000 columns. Here are tests up to 1000 columns after the update in December 2021 https://github.com/delta-io/delta/pull/584

Also, remember that stats are precalculated for the first 32 columns (you can change it to more in settings). So it would be good to have fields on which you merge conditions in the first 32 columns.

StevenW · ‎01-27-2023

Interesting...I had not yet considered the performance issues. I didn't think there would be any .. will clearly need to come back to this one... 🙂

I'm currently only concerned with the maintenance of a piece of code (the 500+ lines of the MERGE statement). Do you just "eyeball" the changes and hope for the best, or is there a more structured way to maintain large MERGE statements?

-werners- · ‎01-27-2023

I am still wondering what you mean by 500+ lines of code for a merge.

do you mean the list of columns which should be updated?

If you want to update a subset of columns that can become cumbersome indeed. But with some coding in scala/python you can create a list of column names which you can then pass to the query.

If you want to update all, use *

MERGE INTO table 
  USING updates 
  ON mgline.RegistrationYear IN ($yearlist) AND
     table.key1 = updates.key1 and 
     table.key2 = updates.key2 and
     table.key3 = updates.key3
  WHEN MATCHED THEN 
     UPDATE SET * 
  WHEN NOT MATCHED THEN 
     INSERT *

StevenW · ‎01-27-2023

Yes, I mean the list of columns becomes large. To then maintain a MERGE statement could be very cumbersome.

Would you happen to have an example of such Python code? This does actually make sense.. if such a list was generated dynamically, and then used for the UPD/INS statements..

-werners- · ‎01-27-2023

I don't have an example at hand, but if you can do a df.columns, that gives you all the cols of the table (in dataframe format of course), then depending on the case you can drop columns or keep a few or ... and then try to use that list for the merge.

TBH I never did that though, I always use update * to avoid the hassle.

StevenW · ‎01-27-2023

Would UPDATE SET * not require the the source and target columns to have the same names, and the same column order ?

-werners- · ‎01-27-2023

don't know about order.

However, I always prepare the incoming data so that it has the same schema as the target. This makes merges easy. You indeed do not want to tinker around in a merge statement and typing tons of columns.

Using scala/python it is practically always possible to prepare your data.

Takes some time to learn, but it is worth it.