Databricks Community

ilarsen · ‎11-13-2023

Hi. I have a question, and I've not been able to find an answer. I'm sure there is one...I just haven't found it through searching and browsing the docs.

How much does it matter (if it is indeed that simple) if source files read by auto loader are in a single folder or structured by subfolders (e.g. YYYY \ MM \ DD).

My environment is Azure Databricks and ADLS gen2 (using hierarchical namespace). In this case, I have 4 "folders" which each contain all the files we've ever received from various post API methods (1 folder for each method). It was not set up to create subfolders based on date. So there's currently from <1 million to > 5 million, depending on the method.

I need to migrate this data, and where this is coming from is - is it worth the effort of copying to a date-based structure, because it will make the auto loader part more efficient, or just dump it over as-is and carry on with life..?

ilarsen · ‎11-14-2023

Thanks for your response, that does help. From what I found - or didn't find, rather - it didn't seem to me like it would be a huge performance impact, either. A full-scale test would perhaps be the only way for me to learn for sure, but that may not be worth the effort. The flat file structure is historical now, a new process lands these files in a subfolder structure.

That said, I am still interested if someone else comes across this and can shed any more light on the potential performance impacts of flat-vs-hierarchical source file folder structures with auto loader ingestion.

Databricks Community

Auto Loader and source file structure optimisation

Connect with Databricks Users in Your Area

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences

Databricks Community Champion - December 2024 - Sujesh Menon