Databricks Community

TMNGB · ‎07-26-2022

In the case of processing multiple source files - with potentially, one or multiple entity versions per source - being able to use the MERGE statement whilst preserving the order is key to ensure the correct versioning of entity versions (aka, version 1 starts at X to Y, then comes version 2 from Y to Z, etc).

However, as far as I can tell, there is no guarantee that the data will be processed (MERGED) according to the order in the DataFrame. Has anyone confirmed this?

The current way to bypass this is to process each extraction date separately for the MERGE statement however, it is quite a slow process since Azure takes a long time to MERGE.

-werners- · ‎07-27-2022

you could guarantee order by actually ordering the dataframe which you want to merge or use a window function (and keeping only the most recent record f.e.).

The upsert is an expensive operation, so depending on the amount of data which has to be evaluated it can take a while indeed.

There are some tweaks possible though:

https://docs.microsoft.com/en-us/azure/databricks/kb/delta/delta-merge-into

View solution in original post

-werners- · ‎07-27-2022

you could guarantee order by actually ordering the dataframe which you want to merge or use a window function (and keeping only the most recent record f.e.).

The upsert is an expensive operation, so depending on the amount of data which has to be evaluated it can take a while indeed.

There are some tweaks possible though:

https://docs.microsoft.com/en-us/azure/databricks/kb/delta/delta-merge-into