Databricks Community

pooja_bhumandla · ‎06-18-2025

Why maxFileSize and minFileSize are different from targetFileSize after optimization? What is the significance of targetFileSize?

"numRemovedFiles": "2099",
"numRemovedBytes": "29658974681",
"p25FileSize": "29701688",
"numDeletionVectorsRemoved": "0",
"minFileSize": "19920357",
"numAddedFiles": "883",
"maxFileSize": "43475356",
"p75FileSize": "34394580",
"p50FileSize": "31978037",
"numAddedBytes": "28254074450"

targetFileSize: "33554432". Explain using the above information.

loui_wentzel · ‎06-18-2025

Depending on your table, running optimize or similar, it might be compute-benificial to have few size-outlies that are bigger or smaller than the target is a better solution, rather than hardsetting all the partitions to a set value. It's simply information that the smallest and biggest partition is 19920357 and 43475356 respectively, and the target was 33554432.

pooja_bhumandla · ‎06-18-2025

Why file size bigger or smaller than the target is a better solution? What are the reasons for it?

loui_wentzel · ‎06-18-2025

there could be several different reasons, but mainly, it's because grouping arbitrary data into some target file-size is well... arbitrary.

Imagine I gave you a large container of sand and some emtpy buckets, and asked you to move the sand from the container to the buckets - aim for half full buckets. As you fill your buckets you realise, you have 3 half-full and only a small portion more. Do you redistribute everything to get to maybe 2/5 on four buckets? or 3/5 on three? or do you have the fourth with only a cup in it? Or 2 half-full and a third a bit over target?

Now this is a very simplistic example, but imagine now types or sand, colours, size of grains, etc and I asked you to make sure to redistribute the types of sand in a very specific way depending on its properties. This is now no longer a simple task, even for three buckets.

This is basically what file-size optimization (if you include partitioning, optimize etc) does. It redestributes everything neatly into buckets, but it is impossible to get the same size in each bucket, as the contents don't neatly divide how you ask it to. That's why it's a target - it's what the machine tries to aim for, but will always have outliers.

I hope this analogy helped 🙂

Databricks Community

Auto tuning of file size

Join Us as a Local Community Builder!

🌟 Community Pulse: Your Weekly Roundup! November 21 – 27, 2025

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples