Re: Delta merge file size control

pantelis_mare · ‎11-12-2021

Hello,

Took some more time investigating and trying @Sandeep Chandran idea.

I ran 4 different configurations. I have cached the update table and each time I was running a restore on the target table so the data we merge are identical.

Here are the files produced by each run on my BIGGEST partition which is the one blocking the stage:

run 1:

spark.databricks.delta.tuneFileSizesForRewrites: false

I suppose it uses file tuning on table size

run2:

spark.databricks.delta.tuneFileSizesForRewrites: false

spark.databricks.delta.optimize.maxFileSize: 268435456

run3:

spark.databricks.delta.tuneFileSizesForRewrites: false

delta.targetFileSize = 268435456 property on target table

run4:

spark.databricks.delta.tuneFileSizesForRewrites: true

As an extra info here is the records per partition,. As you see my dataframe is highly unbalanced.