topic Re: Auto tuning of file size in Data Engineering

Auto tuning of file size

pooja_bhumandla — Wed, 18 Jun 2025 17:01:03 GMT

Why maxFileSize and minFileSize are different from targetFileSize after optimization? What is the significance of targetFileSize?

"numRemovedFiles": "2099",
"numRemovedBytes": "29658974681",
"p25FileSize": "29701688",
"numDeletionVectorsRemoved": "0",
"minFileSize": "19920357",
"numAddedFiles": "883",
"maxFileSize": "43475356",
"p75FileSize": "34394580",
"p50FileSize": "31978037",
"numAddedBytes": "28254074450"

targetFileSize: "33554432". Explain using the above information.

Re: Auto tuning of file size

loui_wentzel — Wed, 18 Jun 2025 19:49:07 GMT

Depending on your table, running optimize or similar, it might be compute-benificial to have few size-outlies that are bigger or smaller than the target is a better solution, rather than hardsetting all the partitions to a set value. It's simply information that the smallest and biggest partition is 19920357 and 43475356 respectively, and the target was 33554432.

Re: Auto tuning of file size

pooja_bhumandla — Thu, 19 Jun 2025 06:25:59 GMT

Why file size bigger or smaller than the target is a better solution? What are the reasons for it?

Re: Auto tuning of file size

loui_wentzel — Thu, 19 Jun 2025 06:46:47 GMT

there could be several different reasons, but mainly, it's because grouping arbitrary data into some target file-size is well... arbitrary.

Imagine I gave you a large container of sand and some emtpy buckets, and asked you to move the sand from the container to the buckets - aim for half full buckets. As you fill your buckets you realise, you have 3 half-full and only a small portion more. Do you redistribute everything to get to maybe 2/5 on four buckets? or 3/5 on three? or do you have the fourth with only a cup in it? Or 2 half-full and a third a bit over target?

Now this is a very simplistic example, but imagine now types or sand, colours, size of grains, etc and I asked you to make sure to redistribute the types of sand in a very specific way depending on its properties. This is now no longer a simple task, even for three buckets.

This is basically what file-size optimization (if you include partitioning, optimize etc) does. It redestributes everything neatly into buckets, but it is impossible to get the same size in each bucket, as the contents don't neatly divide how you ask it to. That's why it's a target - it's what the machine tries to aim for, but will always have outliers.

I hope this analogy helped 🙂