Hi all - quick update, I've rebuilt the table, and am using partitioning now by year and quarter (just a side note really), and after the initial rebuild (using structured streaming loading millions of small json files), the file sizes looked great. But I'm still having problems, and it comes down to the structured streaming writes I have.
2023/Q2 (no file writes as it's too far back, looks great):
2023/Q3 (A few writes since we're near the boundary, you see the small files):
2023/Q4 (majority of writes here, tons of small files):
So then I run an optimize on 2023/Q4 partition and get:
{
"numFilesAdded": 0,
"numFilesRemoved": 0,
"filesAdded": {
"min": null,
"max": null,
"avg": 0,
"totalFiles": 0,
"totalSize": 0
},
"filesRemoved": {
"min": null,
"max": null,
"avg": 0,
"totalFiles": 0,
"totalSize": 0
},
"partitionsOptimized": 1,
"zOrderStats": null,
"numBatches": 0,
"totalConsideredFiles": 2,
"totalFilesSkipped": 2,
"preserveInsertionOrder": true,
"numFilesSkippedToReduceWriteAmplification": 0,
"numBytesSkippedToReduceWriteAmplification": 0,
"startTimeMs": 1698234957310,
"endTimeMs": 1698234959099,
"totalClusterParallelism": 12,
"totalScheduledTasks": 0,
"autoCompactParallelismStats": null,
"deletionVectorStats": {
"numDeletionVectorsRemoved": 0,
"numDeletionVectorRowsRemoved": 0
},
"numTableColumns": 60,
"numTableColumnsWithStats": 32,
"totalTaskExecutionTimeMs": 0,
"skippedArchivedFiles": 0,
"clusteringMetrics": null
}
The files we are reading are json files that correlate to one row in the table and we're processing about 150k files/rows a day into this table, using a standard readstream + writestream in pyspark. This is a delta table, and also has autocompact=true and optimizewrites=true.
I'm at a loss why these files are not getting consolidated, and would love a solution that would keep them optimized in file sizes around 250-300MB. Over time this will become a big performance problem, and I'd rather not have to rebuild my entire table periodically. I'm hoping I'm just missing something key here. Also using clusters with the latest (as of today) 13.3 LTS runtime.
Any insight/help into diagnosing this issue would be greatly appreciated!
Regards,
Brad