cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Autoloader event vs directory ingestion

AndriusVitkausk
New Contributor III

For a production work load containing around 15k gzip compressed json files per hour all in a YYYY/MM/DD/HH/id/timestamp.json.gz directory

What would be the better approach on ingesting this into a delta table in terms of not only the incremental loads, but also reprocessing?

I've so far tried both directory listing and event notification methods through autoloader and event notifications do seem quicker on incremental loads though i'm not sure about not guaranteeing 100% delivery SLA (on this later), but both are tragically slow in reprocessing these kind of workloads.

With event notifications ingesting 15k files per hour and with daily runs accumulating 360k files, could some be missed by event notifications? I've seen an option for backfilling the data at an interval on these notifications, but this comes back to directory listing the entire directory so not sure if re-architecturing how the files drop would help the autoloader at all?

2 REPLIES 2

Kaniz
Community Manager
Community Manager

Hi @Andrius Vitkauskas​​, We haven’t heard from you on the last response and I was checking back to see if you have a resolution yet. If you have any solution, please share it with the community as it can be helpful to others. Otherwise, we will respond with more details and try to help.

AndriusVitkausk
New Contributor III

@Kaniz Fatma​ So i've not found a fix for the small file problem using autoloader, seems to struggle really badly against large directories, had a cluster running for 8h stuck on "listing directory" part with no end, cluster seemed completely idle too, nothing useful in the logs which may suggest there's a bug there?

So tried taking an alternate approach suggested by one of the senior engineers in the company to merge the json files during the copy activity in Azure Datafactory, so 15k json files turned into a single json file and this seems to be performing as expected on databricks. Cluster is in the red on both cpu and memory consumption for processing those huge json files. This should resolve the issue of doing regular backfills as the directory size and im assuming the meta data will be far smaller, and therefore faster.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.