Databricks

pantelis_mare · ‎11-04-2021

Hello community!

I have a rather weird issue where a delta merge is writing very big files (~1GB) that slow down my pipeline. Here is some context:

I have a dataframe containg updates for several dates in the past. Current and last day contain the vast amount of rows (>95%) and the rest are distributed in older days (around 100 other unique dates). My target dataframe is partitioned by date.

The issue I have is that when the merge operation is writing files I end up writing 2-3 files on the largest date partition, resulting to 2-3 files of around 1GB. Thus my whole pipeline is blocked by the write of these files that takes much longer than the other ones.

I have played with all the evident configurations such as:

delta.tuneFileSizesForRewrites

delta.targetFileSize

delta.merge.enableLowShuffle

everything seems to be ignored and the files remain at this scale.

note: running on DBR 10.0 / delta.optimizedWrites.enabled set to true

Is there anything that I am missing?

Thank you in advance!

-werners- · ‎11-05-2021

Maybe the table size is +10TB?

If you use the autotune, delta lake uses a file size based on the table size:

https://docs.microsoft.com/en-us/azure/databricks/delta/optimizations/file-mgmt#autotune-based-on-ta...

However, the targetfilesize should disable the autotune... weird.

I use the following settings (which create files around 256MB):

spark.sql("set spark.databricks.delta.autoCompact.enabled = true")

spark.sql("set spark.databricks.delta.optimizeWrite.enabled = true")

spark.sql("set spark.databricks.delta.merge.enableLowShuffle = true")

Hubert-Dudek · ‎11-05-2021

Delta is transactional file (keeping incremental changes in jsons and snapshots in parquet) usually when I want performance I prefer just use parquet.

Sandeep · ‎11-10-2021

@Pantelis Maroudis , can you try setting spark.databricks.delta.optimize.maxFileSize?

pantelis_mare · ‎11-12-2021

Hello,

Took some more time investigating and trying @Sandeep Chandran idea.

I ran 4 different configurations. I have cached the update table and each time I was running a restore on the target table so the data we merge are identical.

Here are the files produced by each run on my BIGGEST partition which is the one blocking the stage:

run 1:

spark.databricks.delta.tuneFileSizesForRewrites: false

I suppose it uses file tuning on table size

run2:

spark.databricks.delta.tuneFileSizesForRewrites: false

spark.databricks.delta.optimize.maxFileSize: 268435456

run3:

spark.databricks.delta.tuneFileSizesForRewrites: false

delta.targetFileSize = 268435456 property on target table

run4:

spark.databricks.delta.tuneFileSizesForRewrites: true

As an extra info here is the records per partition,. As you see my dataframe is highly unbalanced.

jose_gonzalez · ‎12-06-2021

Hi @Pantelis Maroudis ,

Are you still looking for help to solve this issue?

pantelis_mare · ‎12-09-2021

Hello Jose,

I just went with splitting the merge in 2 so I have a merge that touches many partitions but few rows per file and a second that touches 2-3 partitions but contain the build of the data.

Databricks

Delta merge file size control

Announcing the General Availability of Databricks Asset Bundles

How to successfully build GenAI applications

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Register now and save 50% on training at Data + AI Summit!