08-06-2025 08:35 AM
I'm curious to get thoughts and experience on this. Intuitively, the directory listing mode makes sense to me in order to ensure that only the latest unprocessed files are picked up and processed, but I'm curious about what the cost impact of this would be as more and more files are added? Do the costs scale linearly?
Is there an easy/clear way to understand how the cost scales with the number of files in the directory? I found the following explanation, but I'm still confused as to what kind of costs I may expect with directory listing in ADLS:
08-15-2025 07:19 AM
Autoloader ingests your data incrementally regardless of whether you are on directory listing mode or file notification mode. The key difference lies in how it discovers new files. In directory listing mode, Autoloader queries the cloud storage API to list all files in the directory. At scale, this operation becomes more costly because the amount of metadata it needs to list and process grows with every new file added. The cost is tied to the number of I/O operations and the compute resources needed to handle this growing list, which can scale at a greater than linear rate, especially with deeply nested folder structures.
I advise looking at file notification mode, which works much better at scale. It avoids the expensive directory listing by using a pub/sub messaging service (like Azure Event Grid or AWS SQS) to get a notification whenever a new file arrives. This approach is more efficient because the cost scales with the number of new files, not the total number of files in the directory. In my experience, it provided a 10x performance uplift when streaming billions of small binary files, making it a more cost-effective solution.
08-15-2025 07:19 AM
Autoloader ingests your data incrementally regardless of whether you are on directory listing mode or file notification mode. The key difference lies in how it discovers new files. In directory listing mode, Autoloader queries the cloud storage API to list all files in the directory. At scale, this operation becomes more costly because the amount of metadata it needs to list and process grows with every new file added. The cost is tied to the number of I/O operations and the compute resources needed to handle this growing list, which can scale at a greater than linear rate, especially with deeply nested folder structures.
I advise looking at file notification mode, which works much better at scale. It avoids the expensive directory listing by using a pub/sub messaging service (like Azure Event Grid or AWS SQS) to get a notification whenever a new file arrives. This approach is more efficient because the cost scales with the number of new files, not the total number of files in the directory. In my experience, it provided a 10x performance uplift when streaming billions of small binary files, making it a more cost-effective solution.
08-15-2025 09:35 AM
@kerem thank you for your input. This is helpful, although quick follow-up question. While I understand the scaling differences between directory listing mode and file notification mode, and the 10x performance difference is a good baseline example, I am wondering if there is some way to attribute costs directly to either of these modes? Without getting into complex directory structures, is there a rough way to calculate costs based on a flat directory structure?
08-15-2025 10:12 AM - edited 08-15-2025 10:16 AM
Hi @ChristianRRL,
Both modes incur cost from DBU/h usage based on your compute configuration. Keeping all other variables same, only way I can think of having a cost comparison is to set up two streams that point to the same dataset with different modes and measuring durations of both streams.
When calculating total cost of your setup remember to include:
08-15-2025 10:45 AM - edited 08-15-2025 10:55 AM
Hi @ChristianRRL ,
I guess you can more or less estimate cost of directory listing mode. Below it's explained how directory listing works. Then you can grab azure calculator and check what is the price for listing operation and you will obtain some estimation.
About tracking cost related to file notification this should be much easier. Just apply tags to resources created by autoloader (i.e storage queue) and you can track the cost from there. This is recommended approach by databricks:
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now