cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to control file size by OPTIMIZE

Brad
Contributor II

Hi,

I have a delta table under UC, no partition, no liquid clustering. I tried

 

 

OPTIMIZE foo;
-- OR
ALTER TABLE foo SET TBLPROPERTIES(delta.targetFileSize = '128mb');
OPTIMIZE foo;

 

I expect to see the files can have some change after above, but the OPTIMIZE returns 0 filesAdded and 0 filesRemoved. By "DESCRIBE detail foo" I didn't see numFiles changed.
Am I missing something? How to make the file size as expected? Are there some conditions to trigger OPTIMIZE to control file size?

Thanks

 

3 REPLIES 3

filipniziol
Contributor III

Hi @Brad ,
Could you share the statistics of the table: how many files it has, what is the average file size?
1. Purpose of OPTIMIZE is to combine many small files. Does you table contain many small files?
2. OPTIMIZE is idempotent - if run twice without any data changes in the table, the second OPTIMIZE won't do anything.

In general, based on this articlethe delta.targetFileSize setting acts as a guideline or target for the desired file size, but the actual file sizes can vary based on several factors, including the current size of the table, the nature of the data, and the specific operations being performed:

filipniziol_0-1727980267268.png

 

 

I'm doing some testing so I create some table foo first. And then generate some test data src to do upsert by MERGE. DESC detail foo can see numFiles e.g. 3 when I play around (by MERGE or INSERT), but the files on S3 has more. And I do OPTIMIZE after MERGE or INERT, and not see any changes for files. The parquet files on S3 is from 33KB to 509KB. I expect it can be merged as one file after OPTIMIZE. 

filipniziol
Contributor III

Hi @Brad ,

Databricks is a big data processing engine. Instead of testing 3 files try to test 3000 files 🙂 
OPTIMIZE isn't merging your small files because there may not be enough files or data for it to act upon.

Regarding why DESC DETAILS shows 3 files vs. what is in the folder: numFlies shows the active files, the files that are currently part of the latest snapshot of your delta table.

When you you perform write operations like INSERT, UPDATE, DELETE or MERGE, the old files are not deleted from storage, instead they are marked as deleted in Delta transaction log. They are kept to support features like time travel (you can check what was the version of the table as of yesterday)

 

Hope it makes sense.

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group