Databricks Community

AndriusVitkausk · ‎06-13-2022

For a production work load containing around 15k gzip compressed json files per hour all in a YYYY/MM/DD/HH/id/timestamp.json.gz directory

What would be the better approach on ingesting this into a delta table in terms of not only the incremental loads, but also reprocessing?

I've so far tried both directory listing and event notification methods through autoloader and event notifications do seem quicker on incremental loads though i'm not sure about not guaranteeing 100% delivery SLA (on this later), but both are tragically slow in reprocessing these kind of workloads.

With event notifications ingesting 15k files per hour and with daily runs accumulating 360k files, could some be missed by event notifications? I've seen an option for backfilling the data at an interval on these notifications, but this comes back to directory listing the entire directory so not sure if re-architecturing how the files drop would help the autoloader at all?

AndriusVitkausk · ‎06-15-2022

@Kaniz Fatma So i've not found a fix for the small file problem using autoloader, seems to struggle really badly against large directories, had a cluster running for 8h stuck on "listing directory" part with no end, cluster seemed completely idle too, nothing useful in the logs which may suggest there's a bug there?

So tried taking an alternate approach suggested by one of the senior engineers in the company to merge the json files during the copy activity in Azure Datafactory, so 15k json files turned into a single json file and this seems to be performing as expected on databricks. Cluster is in the red on both cpu and memory consumption for processing those huge json files. This should resolve the issue of doing regular backfills as the directory size and im assuming the meta data will be far smaller, and therefore faster.

Databricks Community

Autoloader event vs directory ingestion

Join Us as a Local Community Builder!

🚀 Weekly Delta (8 - 14 October): A Look Back at This Week’s Top Community Highlights

Databricks Community Champion - September 2025 - Nayanjyoti Sonowal

BrickCon 2025 — Dec 3–5 | A Community Conference for Databricks Builders

🌟 Community Sparks of the Week | September 26 – October 2 🌟

Solution Accelerator Series | #4 - Toxicity Detection for Gaming