Auto Loader and source file structure optimisation

ilarsen — Tue, 14 Nov 2023 02:13:34 GMT

Hi. I have a question, and I've not been able to find an answer. I'm sure there is one...I just haven't found it through searching and browsing the docs.

How much does it matter (if it is indeed that simple) if source files read by auto loader are in a single folder or structured by subfolders (e.g. YYYY \ MM \ DD).

My environment is Azure Databricks and ADLS gen2 (using hierarchical namespace). In this case, I have 4 "folders" which each contain all the files we've ever received from various post API methods (1 folder for each method). It was not set up to create subfolders based on date. So there's currently from <1 million to > 5 million, depending on the method.

I need to migrate this data, and where this is coming from is - is it worth the effort of copying to a date-based structure, because it will make the auto loader part more efficient, or just dump it over as-is and carry on with life..?

Re: Auto Loader and source file structure optimisation

ilarsen — Tue, 14 Nov 2023 22:47:02 GMT

Thanks for your response, that does help. From what I found - or didn't find, rather - it didn't seem to me like it would be a huge performance impact, either. A full-scale test would perhaps be the only way for me to learn for sure, but that may not be worth the effort. The flat file structure is historical now, a new process lands these files in a subfolder structure.

That said, I am still interested if someone else comes across this and can shed any more light on the potential performance impacts of flat-vs-hierarchical source file folder structures with auto loader ingestion.

topic Auto Loader and source file structure optimisation in Data Engineering

Auto Loader and source file structure optimisation

Re: Auto Loader and source file structure optimisation