cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Small Files Persist After OPTIMIZE with Target File Size Set to 100MB – Seeking Possible Reasons

pooja_bhumandla
New Contributor II

I'm currently working on optimizing a Delta table in Databricks. As part of this, I’ve increased the target file size from the (~33MB) to 100MB using the OPTIMIZE command. However, after running the OPTIMIZE operation, I still observe a large number of small files (e.g., 5KB, 10KB, 100KB, 3MB, etc.) within certain partitions.

I'm trying to understand the possible reasons why these small files are not being merged into larger files, despite the new target file size. Specifically:

Why are small files still present after optimization?
What conditions or limitations might prevent these files from being compacted into larger ones?
Shouldn’t even small partitions be compacted into a single file, even if they don’t reach the target size?

I would appreciate any insights or clarifications.

Thanks in advance for your support!

2 REPLIES 2

ilir_nuredini
Valued Contributor

Hello @pooja_bhumandla ,

Even after setting delta.targetFileSize to 100MB, it’s normal to still see smaller files after OPTIMIZE. That setting is only a guideline, not a hard rule. Delta makes a best effort but won’t force all files to match the exact size.

Small files may remain for example due to:

1. Small partitions that don’t have enough data to merge further.

2. Delta avoiding row splits or large shuffle costs during compaction.

3. ZORDER (if used), which keeps files aligned to data layout for faster queries.

So, Delta prioritizes correctness and performance over strict file size, and this is expected behavior.

You can read more about a relevant docs here:

https://docs.databricks.com/aws/en/delta/tune-file-size

https://docs.delta.io/latest/optimizations-oss.html

Hope that helps. Best,Ilir

Brahmareddy
Honored Contributor III

Hi pooja_bhumandla,

Great question! How are you doing today? Even after running the OPTIMIZE command with a higher target file size like 100MB, it’s common to still see some small files in your Delta table—especially in partitions with very little data. This happens because Databricks only compacts files if doing so actually improves performance. For example, if a partition contains just a few megabytes total, it may already be efficient and won’t be merged further just to hit the target size. Also, files created recently or still being written to (e.g., by a streaming job) might be skipped by OPTIMIZE. Another factor could be that certain partitions weren’t included in the OPTIMIZE run if a WHERE clause was used. Lastly, some small files might be kept if they contain special data like change data feed metadata or unique schema versions. If small files are spread across many partitions and hurting performance, consider automating OPTIMIZE for only the latest active partitions (like last 7 days), which balances performance and cost better. Let me know if you'd like a sample maintenance script.

Regards,

Brahma

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now