Re: Autoloader event vs directory ingestion

AndriusVitkausk · ‎06-15-2022

@Kaniz Fatma So i've not found a fix for the small file problem using autoloader, seems to struggle really badly against large directories, had a cluster running for 8h stuck on "listing directory" part with no end, cluster seemed completely idle too, nothing useful in the logs which may suggest there's a bug there?

So tried taking an alternate approach suggested by one of the senior engineers in the company to merge the json files during the copy activity in Azure Datafactory, so 15k json files turned into a single json file and this seems to be performing as expected on databricks. Cluster is in the red on both cpu and memory consumption for processing those huge json files. This should resolve the issue of doing regular backfills as the directory size and im assuming the meta data will be far smaller, and therefore faster.