cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Merge operation replaces most of the underlying parquets

dasiekr
New Contributor II

Hello,

I have the following situation which I would like to fully understand.

I have the delta table that consists of 10k active parquet files. Everyday I run merge operation based on new deliveries and joining by product_id key attribute. I checked metadata and it turns out that rows coming from the latest delivery (marked with delivery_id column) usually are stored in around 150 parquets. The thing is that when i go to operation metrics of a merge I see that most of the files (around 8k) get removed in exchange for the newly created (again around 8k) files. After each operation I run optimize z-order by product_id but its operation metrics  show that it removes ~200 files. The question is why does each merge operation need to replace almost all parquet files that were used in the previous version of a delta table? I thought that new delivery brings so many updates that it has to affect that many parquets but as I mentioned it seems that it should affect data in ~ 200 files. This issue generates redundant data despite running vacuum. Did anyone have similar issue?

3 REPLIES 3

Ajay-Pandey
Esteemed Contributor III

Hi @dasiekr , Please refer to the below content that might help you -

MERGE: Under the hood

Delta Lake completes a MERGE in two steps.

  1. Perform an inner join between the target table and source table to select all files that have matches.
  2. Perform an outer join between the selected files in the target and source tables and write out the updated/deleted/inserted data.

The main way that this differs from an UPDATE or a DELETE under the hood is that Delta Lake uses joins to complete a MERGE. This fact allows us to utilize some unique strategies when seeking to improve performance.

MERGE: Performance tuning tips

To improve performance of the MERGE command, you need to determine which of the two joins that make up the merge is limiting your speed.

If the inner join is the bottleneck (i.e., finding the files that Delta Lake needs to rewrite takes too long), try the following strategies:

    • Add more predicates to narrow down the search space.
    • Adjust shuffle partitions.
    • Adjust broadcast join thresholds.
    • Compact the small files in the table if there are lots of them, but don't compact them into files that are too large, since Delta Lake has to copy the entire file to rewrite it.

On the other hand, if the outer join is the bottleneck (i.e. rewriting the actual files themselves takes too long), try the strategies below:

  • Adjust shuffle partitions.
    • Can generate too many small files for partitioned tables.
    • Reduce files by enabling automatic repartitioning before writes (with Optimized Writes in Databricks Delta Lake)
  • Adjust broadcast thresholds. If you're doing a full outer join, Spark cannot do a broadcast join, but if you're doing a right outer join, Spark can do one, and you can adjust the broadcast thresholds as needed.
  • Cache the source table / DataFrame.
    • Caching the source table can speed up the second scan, but be sure not to cache the target table, as this can lead to cache coherency issues.
Ajay Kumar Pandey

dasiekr
New Contributor II

Hi @Ajay-Pandey ,

Thanks for your response.

However, the problem here is not the performence of merge in terms of speed. This operation nearly set up delta from scratch as it's add and removes most of currently active parquets while only ~200 parquets are affected by last delivery data. This is what I can't understand - how is that happening.

Ajay-Pandey
Esteemed Contributor III

In normal scenario it should update the files which have the respective data on them, this is looking weird. 

Ajay Kumar Pandey

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group