Databricks

AndriusVitkausk · ‎06-13-2022

For a production work load containing around 15k gzip compressed json files per hour all in a YYYY/MM/DD/HH/id/timestamp.json.gz directory

What would be the better approach on ingesting this into a delta table in terms of not only the incremental loads, but also reprocessing?

I've so far tried both directory listing and event notification methods through autoloader and event notifications do seem quicker on incremental loads though i'm not sure about not guaranteeing 100% delivery SLA (on this later), but both are tragically slow in reprocessing these kind of workloads.

With event notifications ingesting 15k files per hour and with daily runs accumulating 360k files, could some be missed by event notifications? I've seen an option for backfilling the data at an interval on these notifications, but this comes back to directory listing the entire directory so not sure if re-architecturing how the files drop would help the autoloader at all?

Kaniz · ‎06-14-2022

Hi @Andrius Vitkauskas, We haven’t heard from you on the last response and I was checking back to see if you have a resolution yet. If you have any solution, please share it with the community as it can be helpful to others. Otherwise, we will respond with more details and try to help.

AndriusVitkausk · ‎06-15-2022

@Kaniz Fatma So i've not found a fix for the small file problem using autoloader, seems to struggle really badly against large directories, had a cluster running for 8h stuck on "listing directory" part with no end, cluster seemed completely idle too, nothing useful in the logs which may suggest there's a bug there?

So tried taking an alternate approach suggested by one of the senior engineers in the company to merge the json files during the copy activity in Azure Datafactory, so 15k json files turned into a single json file and this seems to be performing as expected on databricks. Cluster is in the red on both cpu and memory consumption for processing those huge json files. This should resolve the issue of doing regular backfills as the directory size and im assuming the meta data will be far smaller, and therefore faster.

Databricks

Autoloader event vs directory ingestion

How to successfully build GenAI applications

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI