02-03-2022 09:00 AM
We are reading files using Autoloader in Databricks. Source system is giving full snapshot of complete data in files. So we want to read the data and write in delta table in override mode so all old data is replaced by the new data. Similarly for other use case, we have requirement to merge and update existing records in delta table.
While Autoloader only supports writing in append mode, is there any option to write in override and merge.
Below is the code we are using for writing in append mode using AutoLoader
df.writeStream.format("delta").option("mergeSchema", "true").outputMode("append").option("checkpointLocation", checkpointLocation).trigger(once=True).start(deltaLakePath)
02-04-2022 12:06 AM
@Ranjeet Jaiswal ,
afaik merge is supported:
https://docs.databricks.com/_static/notebooks/merge-in-streaming.html
This link does some aggregation but that can be ommitted of course.
The interesting part here is outputMode("update"), and the foreachBatch function which does the actual merge
02-03-2022 09:25 AM
Hello there, @Ranjeet Jaiswal! My name is Piper, and I'm a moderator for the community. Welcome and thank you for your question. We will give your peers a chance to respond before we come back to this.
Thanks in advance for your patience. 🙂
02-04-2022 12:06 AM
@Ranjeet Jaiswal ,
afaik merge is supported:
https://docs.databricks.com/_static/notebooks/merge-in-streaming.html
This link does some aggregation but that can be ommitted of course.
The interesting part here is outputMode("update"), and the foreachBatch function which does the actual merge
02-07-2022 08:16 AM
This is what we are using but I was wondering if this is the most efficient way.
In our example, we get around 30-40 million records with every new file (source is giving the complete snapshot of data). Thus, even if there is change in a few records, it need to go through merging of all the 30-40 million records, which takes some time.
Not sure if there is any efficient way to handle this complete override of data use case without the expensive merge.
11-15-2022 11:38 AM
I have the same question.... the incoming/streaming data (using AutoLoader) could have 1 to 1M records. We need to find out from our target delta table (which has Billions of rows) which of these primary keys already exist, ignore them, and only insert the new ones. This Merge solution sounds a bit inefficient. Or perhaps I'm missing some special sauce??
11-16-2022 02:28 AM
If you want to go for an insert-only approach, there is a way, but it also involves some overhead. You can do a left_anti join of your new data against the delta table.
Only records that do not yet exist remain and those can be appended.
Of course this also means that changes of existing records are not registered in the delta table.
The reason it works as it is is because parquet files are immutable. If you want to change 1 record in a parquet file; you have to rewrite the whole file.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group