Small files and discrepancy in S3 vs catalog

Community Platform Discussions

Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.

Hello all,

I'm in the process of optimizing my tables and I'm running into a confusing situation. I have a table named "trace_messages_fg_streaming_event". If I navigate to the Databricks catalog, it shows stats:

Size: 6.7GB, Files: 464

But when I look at our actual S3 location

File Count: 2544762
Total Size (in bytes): 789695631757 (735.69 gigabytes)

The file count above is gathered using a script parsing an aws s3 ls command line for the root folder of my delta table. This is not a recursive call so it's only counting root files, not files under the delta log folder, etc. We chose not to partition this table as given the size of our data it seemed that Zordering the whole table nightly was a better solution for consolidated files (and thus performance).

Table config:

delta.autoOptimize.autoCompact=false
delta.autoOptimize.optimizeWrite=true
delta.minReaderVersion=1
delta.minWriterVersion=6
delta.targetFileSize=1073741824

This table is being constantly written to by a write stream, so I understand that there may be smaller files as part of the write, but those should be cleaned up with the nightly optimize. These small files we're seeing are spread out over the past year. Example screen shot:

Zordering is based on day for these logs - we've tried several approaches including optimize without zordering and we still have many many small files. Also the Zorder doesn't really take that long either (a couple minutes) which is suspicious to me.

Any insight/help on what we're seeing would be greatly appreciated!

Thanks, Brad

0 REPLIES 0

Photos

Upload Upload
URL URL
Saved Photos Saved Photos

Upload location

Upload location

Add Photos to Album:

New Album

Drag here to start uploading

Drag photos here or

Tap for upload options

You must install or upgrade to the latest version of Adobe Flash Player before you can upload images.