cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Autoloader event vs directory ingestion

AndriusVitkausk
New Contributor III

For a production work load containing around 15k gzip compressed json files per hour all in a YYYY/MM/DD/HH/id/timestamp.json.gz directory

What would be the better approach on ingesting this into a delta table in terms of not only the incremental loads, but also reprocessing?

I've so far tried both directory listing and event notification methods through autoloader and event notifications do seem quicker on incremental loads though i'm not sure about not guaranteeing 100% delivery SLA (on this later), but both are tragically slow in reprocessing these kind of workloads.

With event notifications ingesting 15k files per hour and with daily runs accumulating 360k files, could some be missed by event notifications? I've seen an option for backfilling the data at an interval on these notifications, but this comes back to directory listing the entire directory so not sure if re-architecturing how the files drop would help the autoloader at all?

2 REPLIES 2

Kaniz_Fatma
Community Manager
Community Manager

Hi @Andrius Vitkauskas​​, We haven’t heard from you on the last response and I was checking back to see if you have a resolution yet. If you have any solution, please share it with the community as it can be helpful to others. Otherwise, we will respond with more details and try to help.

AndriusVitkausk
New Contributor III

@Kaniz Fatma​ So i've not found a fix for the small file problem using autoloader, seems to struggle really badly against large directories, had a cluster running for 8h stuck on "listing directory" part with no end, cluster seemed completely idle too, nothing useful in the logs which may suggest there's a bug there?

So tried taking an alternate approach suggested by one of the senior engineers in the company to merge the json files during the copy activity in Azure Datafactory, so 15k json files turned into a single json file and this seems to be performing as expected on databricks. Cluster is in the red on both cpu and memory consumption for processing those huge json files. This should resolve the issue of doing regular backfills as the directory size and im assuming the meta data will be far smaller, and therefore faster.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!