Databricks Community

noorbasha534 · ‎07-31-2025

hello all

while reading content that provides guidance on delta lake file size, i realized tuneFileSizesForRewrites behind the scenes targets for 256 MB file size.

and optimize.maxFileSize will target for 1 GB file ((reference : https://docs.databricks.com/aws/en/sql/language-manual/delta-optimize)).

In our environment, we have tried using tuneFileSizesForRewrites and we saw around 86 MB size files for a table but optimize did not make them bigger, not even 126 MB, forget about 1 GB.

also, I read from 'data lake - definitive guide' book that prior to DBR 8.4 - tuneFileSizesForRewrites was based on the number of merge operations happened during the last 10 operations, as on, the overall idea was to create smaller files for merge performance improvement. We realized this in some form while testing to decide between tuneFileSizesForRewrites & setting the targetFileSize manually - if we specify targetFileSize = 256MB unlike our current targetFileSize = 33 MB, the MERGE performance (bronze > silver layer) was slower. Now, few questions -

in our environment, optimize did not make 1 gb files, is that the case with you all
if optimize really makes bigger files of around 1 gb, will merge performance be not be a problem then?
in the raw (plain parquet/source mimic) > bronze layer load which is append-only, can we specify a very big targetFileSize? In the 'data lake - definitive guide' book, the author makes the claim - 'High-volume append-only tables in the bronze layer generally function better with larger file sizes, as the larger sizes maximize throughput per operation with little regard to anything else.'

radothede · ‎07-31-2025

hello @noorbasha534

That's very interesting topic regarding fine-tuning file sizes under delta table.

Answering Your questions:

1)

I use spark.databricks.delta.optimize.maxFileSize and set maximum file size for optimize command. Its working for me just fine in most cases. OPTIMIZE is creating new version of delta table, in my scenario with file sizes close to maximum limit. Please remember its maximum limit, not desired size of parquet files. Final parquet file size depends of other factors such as data distribution across partitions, clustering keys, is there sufficient data in Your table.

2)

Larger files (approaching ~1GB) can impact MERGE performance because:

a) MERGE operations need to rewrite entire files when any record in that file is affected

b) Larger files mean more data movement even for small changes

c) The overhead increases with file size, especially for selective updates

I would recommend using tuneFileSizesForRewrites for Your silver layer.

3)

I would say YES, go for larger maxFileSize limit, but I belive it should not be larger than 1 GB.

As an additional resources I would recommend to take a look at this: delta tune file size

Best,

Radek.

Databricks Community

Delta Lake File Sizes - optimize maxfilesize, tunefilesizesforrewrites

Join Us as a Local Community Builder!

🌟 Community Pulse: Your Weekly Roundup! November 14 – 20, 2025

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples

Portland Data + AI Meetup — Holiday Event - Wednesday, December 3rd