- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-26-2022 07:01 AM
In the case of processing multiple source files - with potentially, one or multiple entity versions per source - being able to use the MERGE statement whilst preserving the order is key to ensure the correct versioning of entity versions (aka, version 1 starts at X to Y, then comes version 2 from Y to Z, etc).
However, as far as I can tell, there is no guarantee that the data will be processed (MERGED) according to the order in the DataFrame. Has anyone confirmed this?
The current way to bypass this is to process each extraction date separately for the MERGE statement however, it is quite a slow process since Azure takes a long time to MERGE.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-27-2022 04:06 AM
you could guarantee order by actually ordering the dataframe which you want to merge or use a window function (and keeping only the most recent record f.e.).
The upsert is an expensive operation, so depending on the amount of data which has to be evaluated it can take a while indeed.
There are some tweaks possible though:
https://docs.microsoft.com/en-us/azure/databricks/kb/delta/delta-merge-into
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-29-2022 10:09 AM
Hi @Guilherme Banhudo I hope that werners answer would have helped you. Please let me know if you still have doubts or queries.