cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Auto Loader and source file structure optimisation

ilarsen
Contributor

Hi.  I have a question, and I've not been able to find an answer.  I'm sure there is one...I just haven't found it through searching and browsing the docs.

 

How much does it matter (if it is indeed that simple) if source files read by auto loader are in a single folder or structured by subfolders (e.g. YYYY \ MM \ DD).

 

My environment is Azure Databricks and ADLS gen2 (using hierarchical namespace).  In this case, I have 4 "folders" which each contain all the files we've ever received from various post API methods (1 folder for each method).  It was not set up to create subfolders based on date.  So there's currently from <1 million to > 5 million, depending on the method.

 

I need to migrate this data, and where this is coming from is - is it worth the effort of copying to a date-based structure, because it will make the auto loader part more efficient, or just dump it over as-is and carry on with life..?

 

2 REPLIES 2

Kaniz
Community Manager
Community Manager

Hi @ilarsen, According to the Azure Databricks documentation, Auto Loader incrementally and efficiently processes new data files as they arrive in cloud storage. Auto Loader can load data files from Azure data lake Storage Gen2 (ADLS Gen2) using hierarchical namespace. Auto Loader provides a Structured Streaming source called cloudFiles. Given an input directory path on the cloud file storage, the cloudFiles source automatically processes new files as they arrive, with the option of also processing existing files in that directory.

 

In terms of the folder structure, Auto Loader uses directory listing mode by default. In directory listing mode, Auto Loader identifies new files by listing the input directory. Directory listing mode allows you to quickly start Auto Loader streams without any permission configurations other than access to your data on cloud storage. Azure Databricks has optimized directory listing mode for Auto Loader to discover files in cloud sto...2.

 

Based on this information, it seems that the folder structure of your source files may not have a significant impact on the efficiency of Auto Loader. However, you may want to consider organizing your data into a date-based structure for better data management and organization.

 

I hope this helps!

Thanks for your response, that does help.  From what I found - or didn't find, rather - it didn't seem to me like it would be a huge performance impact, either.  A full-scale test would perhaps be the only way for me to learn for sure, but that may not be worth the effort.  The flat file structure is historical now, a new process lands these files in a subfolder structure.

 

That said, I am still interested if someone else comes across this and can shed any more light on the potential performance impacts of flat-vs-hierarchical source file folder structures with auto loader ingestion.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.