Databricks Community

dasiekr · ‎03-14-2024

Hello,

I have the following situation which I would like to fully understand.

I have the delta table that consists of 10k active parquet files. Everyday I run merge operation based on new deliveries and joining by product_id key attribute. I checked metadata and it turns out that rows coming from the latest delivery (marked with delivery_id column) usually are stored in around 150 parquets. The thing is that when i go to operation metrics of a merge I see that most of the files (around 8k) get removed in exchange for the newly created (again around 8k) files. After each operation I run optimize z-order by product_id but its operation metrics show that it removes ~200 files. The question is why does each merge operation need to replace almost all parquet files that were used in the previous version of a delta table? I thought that new delivery brings so many updates that it has to affect that many parquets but as I mentioned it seems that it should affect data in ~ 200 files. This issue generates redundant data despite running vacuum. Did anyone have similar issue?

Ajay-Pandey · ‎03-14-2024

Hi @dasiekr , Please refer to the below content that might help you -

MERGE: Under the hood

Delta Lake completes a MERGE in two steps.

Perform an inner join between the target table and source table to select all files that have matches.
Perform an outer join between the selected files in the target and source tables and write out the updated/deleted/inserted data.

The main way that this differs from an UPDATE or a DELETE under the hood is that Delta Lake uses joins to complete a MERGE. This fact allows us to utilize some unique strategies when seeking to improve performance.

MERGE: Performance tuning tips

To improve performance of the MERGE command, you need to determine which of the two joins that make up the merge is limiting your speed.

If the inner join is the bottleneck (i.e., finding the files that Delta Lake needs to rewrite takes too long), try the following strategies:

- Add more predicates to narrow down the search space.
- Adjust shuffle partitions.
- Adjust broadcast join thresholds.
- Compact the small files in the table if there are lots of them, but don't compact them into files that are too large, since Delta Lake has to copy the entire file to rewrite it.

On the other hand, if the outer join is the bottleneck (i.e. rewriting the actual files themselves takes too long), try the strategies below:

Adjust shuffle partitions.
- Can generate too many small files for partitioned tables.
- Reduce files by enabling automatic repartitioning before writes (with Optimized Writes in Databricks Delta Lake)
Adjust broadcast thresholds. If you're doing a full outer join, Spark cannot do a broadcast join, but if you're doing a right outer join, Spark can do one, and you can adjust the broadcast thresholds as needed.
Cache the source table / DataFrame.
- Caching the source table can speed up the second scan, but be sure not to cache the target table, as this can lead to cache coherency issues.

Ajay Kumar Pandey

dasiekr · ‎03-15-2024

Hi @Ajay-Pandey ,

Thanks for your response.

However, the problem here is not the performence of merge in terms of speed. This operation nearly set up delta from scratch as it's add and removes most of currently active parquets while only ~200 parquets are affected by last delivery data. This is what I can't understand - how is that happening.