cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

data file size

pooja_bhumandla
New Contributor II

"numRemovedFiles": "2099",
"numRemovedBytes": "29658974681",
"p25FileSize": "29701688",
"numDeletionVectorsRemoved": "0",
"minFileSize": "19920357",
"numAddedFiles": "883",
"maxFileSize": "43475356",
"p75FileSize": "34394580",
"p50FileSize": "31978037",
"numAddedBytes": "28254074450"

targetFileSize: "33554432"

I have the above information after optimizing. Why is the maxFileSize is greater than targetFileSize and similarly y is minFileSize is less than targetFileSize? does targetFileSize has any significance? If it has any significance, then y max and min file sizes are not same as targetFileSize? Based on what criteria the maxFileSize and minFileSize are decided? 

3 REPLIES 3

ilir_nuredini
Valued Contributor

Hello Pooja

Target File Size (TFS) is a Delta Lake table property (delta.targetFileSize) that provides the flexibility to specify the desired size of the data files in the root Delta Lake table directory. It ensures Delta Lake tables are written to storage with the specified, approximate, file size. So definitely it is important, but Delta Lake does not guarantee that all output files after OPTIMIZE will be exactly targetFileSize. It instead aims to:

1. Avoid small files
2. Avoid splitting rows or complex data types mid-record
3. and so on

Thats why you see variation on the min and max, and on the percentiles stats.
While the maxFileSize and minFileSize are based on these criterias (not only):

1. targetFileSize (as a guideline)
2. Partition Size & Skew
3. Row and schema characteristics
4. ...

Best, Ilir

pooja_bhumandla
New Contributor II

What are the criterias based on which max and min files sizes vary from target file size? 

The criterias based on which the max and min size may vary from the target file size are:

1. Partition Size & Data Skew
2. Row Size and Schema Complexity
3. Cost-Based Optimization Heuristics
4. ...

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now