cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Merge rewrites many unmodified files

MuesLee
New Contributor

Hello. I want to do a merge on a subset of my delta table partitions to do incremental upserts to keep two tables in sync. I do not use a whenNotMatchedBySource statement to clean up stale rows in my target because of this GitHub Issue

Because of that I am just focussing on update and insert in my first step. In this first step I am facing an issue that I need help with. As prerequisite for my merge I have extracted the most recent and relevant partitions (last X partitions) of the dataframe that I want to sync my delta table with and created a spark dataframe that only contains the relvant partitions called "updates" (to "prune" the source) (sidenote: delta table and spark dataframe are both having equivalent schema).

My merge statement looks like this: 

 

deltaTable_target.alias("target").merge(df_subset_relevant.alias("source"), f"target.PARTITION IN ({partition_list}) AND target.hashkey = source.HASHKEY").whenNotMatchedInsertAll().whenMatchedUpdateAll(condition = "target.col_that_can_change <> source.col_that_can_change").execute() 

 

The merge statement itself works correctly according to the log files of the delta table (everything is inserted and updated correctly). Yet, when merging it marks even files as touched (and rewrites them) that do not have any changes in them (to be more specific I know that one of the partitions is exactly the same between source and target and it gets rewritten completely).

My assumption is that this is forced by the general merge statement that is used to partition prune the target table. At least I cannot imagine any other reason for that.

In an ideal world the numTargetRowsCopied size in the logs should be the size of the missing partitions and not the size of all partitions that are part of the merge, am I correct?

I mean maybe I do not understand the inner workings correclty and this is a necessary step. Any insight in this problem is highly appreciated.

1 REPLY 1

Brahmareddy
Honored Contributor III

Hi MuesLee,

How are you doing today?, as per my understanding, Yes, your understanding is mostly correct. The reason even unchanged partitions are being rewritten is likely because of how Delta Lake’s merge operation handles partition pruning and updates. Even though your merge condition is filtering partitions using target.PARTITION IN ({partition_list}), Delta Lake still rewrites entire matching partitions if any row within them needs to be updated. This is because Delta operates at the file level, not row level, meaning that if even one row in a partition is updated, the entire partition file may get rewritten.

To minimize this, you could try Z-Ordering on hashkey to improve file layout and reduce the number of files rewritten. Another approach is using Optimized Writes and Change Data Feed (CDF) if available, to track and update only changed rows more efficiently. While Delta’s merge isn’t as fine-grained as traditional row-based databases, understanding that it rewrites files at the partition level can help you design better strategies for minimizing unnecessary rewrites.

Regards,

Brahma

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now