data file size
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-18-2025 07:35 AM
"numRemovedFiles": "2099",
"numRemovedBytes": "29658974681",
"p25FileSize": "29701688",
"numDeletionVectorsRemoved": "0",
"minFileSize": "19920357",
"numAddedFiles": "883",
"maxFileSize": "43475356",
"p75FileSize": "34394580",
"p50FileSize": "31978037",
"numAddedBytes": "28254074450"
targetFileSize: "33554432"
I have the above information after optimizing. Why is the maxFileSize is greater than targetFileSize and similarly y is minFileSize is less than targetFileSize? does targetFileSize has any significance? If it has any significance, then y max and min file sizes are not same as targetFileSize? Based on what criteria the maxFileSize and minFileSize are decided?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-18-2025 08:19 AM
Hello Pooja
Target File Size (TFS) is a Delta Lake table property (delta.targetFileSize) that provides the flexibility to specify the desired size of the data files in the root Delta Lake table directory. It ensures Delta Lake tables are written to storage with the specified, approximate, file size. So definitely it is important, but Delta Lake does not guarantee that all output files after OPTIMIZE will be exactly targetFileSize. It instead aims to:
1. Avoid small files
2. Avoid splitting rows or complex data types mid-record
3. and so on
Thats why you see variation on the min and max, and on the percentiles stats.
While the maxFileSize and minFileSize are based on these criterias (not only):
1. targetFileSize (as a guideline)
2. Partition Size & Skew
3. Row and schema characteristics
4. ...
Best, Ilir
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-19-2025 12:33 AM
What are the criterias based on which max and min files sizes vary from target file size?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-19-2025 02:50 AM
The criterias based on which the max and min size may vary from the target file size are:
1. Partition Size & Data Skew
2. Row Size and Schema Complexity
3. Cost-Based Optimization Heuristics
4. ...