cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Auto tuning of file size

pooja_bhumandla
New Contributor II

Why maxFileSize and minFileSize are different from targetFileSize after optimization? What is the significance of targetFileSize? 

"numRemovedFiles": "2099",
"numRemovedBytes": "29658974681",
"p25FileSize": "29701688",
"numDeletionVectorsRemoved": "0",
"minFileSize": "19920357",
"numAddedFiles": "883",
"maxFileSize": "43475356",
"p75FileSize": "34394580",
"p50FileSize": "31978037",
"numAddedBytes": "28254074450"

targetFileSize: "33554432". Explain using the above information.

3 REPLIES 3

loui_wentzel
New Contributor III

Depending on your table, running optimize or similar, it might be compute-benificial to have few size-outlies that are bigger or smaller than the target is a better solution, rather than hardsetting all the partitions to a set value. It's simply information that the smallest and biggest partition is 19920357 and 43475356 respectively, and the target was 33554432.

 

pooja_bhumandla
New Contributor II

Why file size bigger or smaller than the target is a better solution? What are the reasons for it?

loui_wentzel
New Contributor III

there could be several different reasons, but mainly, it's because grouping arbitrary data into some target file-size is well... arbitrary.

Imagine I gave you a large container of sand and some emtpy buckets, and asked you to move the sand from the container to the buckets - aim for half full buckets. As you fill your buckets you realise, you have 3 half-full and only a small portion more. Do you redistribute everything to get to maybe 2/5 on four buckets? or 3/5 on three? or do you have the fourth with only a cup in it? Or 2 half-full and a third a bit over target?

Now this is a very simplistic example, but imagine now types or sand, colours, size of grains, etc and I asked you to make sure to redistribute the types of sand in a very specific way depending on its properties. This is now no longer a simple task, even for three buckets.

This is basically what file-size optimization (if you include partitioning, optimize etc) does. It redestributes everything neatly into buckets, but it is impossible to get the same size in each bucket, as the contents don't neatly divide how you ask it to. That's why it's a target - it's what the machine tries to aim for, but will always have outliers.

I hope this analogy helped ๐Ÿ™‚

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now