Databricks Community

ChristianRRL · ‎08-06-2025

I'm curious to get thoughts and experience on this. Intuitively, the directory listing mode makes sense to me in order to ensure that only the latest unprocessed files are picked up and processed, but I'm curious about what the cost impact of this would be as more and more files are added? Do the costs scale linearly?

Is there an easy/clear way to understand how the cost scales with the number of files in the directory? I found the following explanation, but I'm still confused as to what kind of costs I may expect with directory listing in ADLS:

How does directory listing mode work?

kerem · ‎08-15-2025

Hi @ChristianRRL

Autoloader ingests your data incrementally regardless of whether you are on directory listing mode or file notification mode. The key difference lies in how it discovers new files. In directory listing mode, Autoloader queries the cloud storage API to list all files in the directory. At scale, this operation becomes more costly because the amount of metadata it needs to list and process grows with every new file added. The cost is tied to the number of I/O operations and the compute resources needed to handle this growing list, which can scale at a greater than linear rate, especially with deeply nested folder structures.

I advise looking at file notification mode, which works much better at scale. It avoids the expensive directory listing by using a pub/sub messaging service (like Azure Event Grid or AWS SQS) to get a notification whenever a new file arrives. This approach is more efficient because the cost scales with the number of new files, not the total number of files in the directory. In my experience, it provided a 10x performance uplift when streaming billions of small binary files, making it a more cost-effective solution.

View solution in original post

kerem · ‎08-15-2025

Hi @ChristianRRL

Autoloader ingests your data incrementally regardless of whether you are on directory listing mode or file notification mode. The key difference lies in how it discovers new files. In directory listing mode, Autoloader queries the cloud storage API to list all files in the directory. At scale, this operation becomes more costly because the amount of metadata it needs to list and process grows with every new file added. The cost is tied to the number of I/O operations and the compute resources needed to handle this growing list, which can scale at a greater than linear rate, especially with deeply nested folder structures.

I advise looking at file notification mode, which works much better at scale. It avoids the expensive directory listing by using a pub/sub messaging service (like Azure Event Grid or AWS SQS) to get a notification whenever a new file arrives. This approach is more efficient because the cost scales with the number of new files, not the total number of files in the directory. In my experience, it provided a 10x performance uplift when streaming billions of small binary files, making it a more cost-effective solution.

ChristianRRL · ‎08-15-2025

@kerem thank you for your input. This is helpful, although quick follow-up question. While I understand the scaling differences between directory listing mode and file notification mode, and the 10x performance difference is a good baseline example, I am wondering if there is some way to attribute costs directly to either of these modes? Without getting into complex directory structures, is there a rough way to calculate costs based on a flat directory structure?

kerem · ‎08-15-2025

Hi @ChristianRRL,

Both modes incur cost from DBU/h usage based on your compute configuration. Keeping all other variables same, only way I can think of having a cost comparison is to set up two streams that point to the same dataset with different modes and measuring durations of both streams.

When calculating total cost of your setup remember to include:

Cost of the I/O operations from your storage provider, especially on directory listing mode.
Cost of the notification and queueing service required for file notification mode.

szymon_dybczak · ‎08-15-2025

Hi @ChristianRRL ,

I guess you can more or less estimate cost of directory listing mode. Below it's explained how directory listing works. Then you can grab azure calculator and check what is the price for listing operation and you will obtain some estimation.

https://learn.microsoft.com/en-us/azure/databricks/ingestion/cloud-object-storage/auto-loader/direct...

About tracking cost related to file notification this should be much easier. Just apply tags to resources created by autoloader (i.e storage queue) and you can track the cost from there. This is recommended approach by databricks:

https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/production#cost-consid...

Databricks Community

AutoLoader - Cost of Directory Listing Mode

Join Us as a Local Community Builder!

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

🌟 Community Pulse: Your Weekly Roundup! November 14 – 20, 2025

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples