cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Delta Lake File Sizes - optimize maxfilesize, tunefilesizesforrewrites

noorbasha534
Valued Contributor II

hello all

while reading content that provides guidance on delta lake file size, i realized tuneFileSizesForRewrites behind the scenes targets for 256 MB file size.

and optimize.maxFileSize will target for 1 GB file ((reference : https://docs.databricks.com/aws/en/sql/language-manual/delta-optimize)).

In our environment, we have tried using tuneFileSizesForRewrites and we saw around 86 MB size files for a table but optimize did not make them bigger, not even 126 MB, forget about 1 GB.

also, I read from 'data lake - definitive guide' book that prior to DBR 8.4 - tuneFileSizesForRewrites was based on the number of merge operations happened during the last 10 operations, as on, the overall idea was to create smaller files for merge performance improvement. We realized this in some form while testing to decide between tuneFileSizesForRewrites & setting the targetFileSize manually - if we specify targetFileSize = 256MB unlike our current targetFileSize = 33 MB, the MERGE performance (bronze > silver layer) was slower. Now, few questions -

  • in our environment, optimize did not make 1 gb files, is that the case with you all
  • if optimize really makes bigger files of around 1 gb, will merge performance be not be a problem then?
  • in the raw (plain parquet/source mimic) > bronze layer load which is append-only, can we specify a very big targetFileSize? In the 'data lake - definitive guide' book, the author makes the claim - 'High-volume append-only tables in the bronze layer generally function better with larger file sizes, as the larger sizes maximize throughput per operation with little regard to anything else.'

 

1 REPLY 1

radothede
Valued Contributor II

hello @noorbasha534 

That's very interesting topic regarding fine-tuning file sizes under delta table.

Answering Your questions:

1)

I use spark.databricks.delta.optimize.maxFileSize and set maximum file size for optimize command. Its working for me just fine in most cases. OPTIMIZE is creating new version of delta table, in my scenario with file sizes close to maximum limit. Please remember its maximum limit, not desired size of parquet files. Final parquet file size depends of other factors such as data distribution across partitions, clustering keys, is there sufficient data in Your table.

 

2)

Larger files (approaching ~1GB) can impact MERGE performance because:

a) MERGE operations need to rewrite entire files when any record in that file is affected

b) Larger files mean more data movement even for small changes

c) The overhead increases with file size, especially for selective updates

I would recommend using tuneFileSizesForRewrites for Your silver layer.

 

3)

I would say YES, go for larger maxFileSize limit, but I belive it should not be larger than 1 GB.

 

As an additional resources I would recommend to take a look at this: delta tune file size 

 

Best,

Radek.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now