cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
MuraliTalluri
Databricks Employee
Databricks Employee

autoloader-file-events-banner.png

Summary

  • Learn how Auto Loader with file events simplifies cloud storage ingestion by eliminating the need to choose between directory listing simplicity and classic notifications performance.
  • Discover the architectural differences and benefits of File Events compared to classic notifications, including simplifying permissions, mitigating cloud provider limits, and automatically managing resources.
  • Learn how to enable File Events, when to migrate, and best practices to achieve optimal file discovery with file events.

 

Databricks customers rely on Auto Loader to access cloud storage data at scale. Now, file events are taking Auto Loader to the next level—delivering the speed and efficiency of incremental, notification-based processing alongside simplified setup, automated lifecycle management, and seamless scalability. The result is faster ingestion, less operational overhead, and infrastructure that scales as your data grows.

 

Challenges with cloud storage ingestion

Historically, Databricks customers have chosen between two modes of Auto Loader. Directory listing is simple to set up but can be expensive and slow for high file volumes, since it requires repeatedly scanning the cloud storage directory to identify any new files.  Classic notifications are more efficient, but they require permissions for setting up notification resources (like SQS or Event Grid), in addition to those granted via Unity Catalog. 

Cloud providers also limit the number of event notifications per single storage container. Since each Auto Loader job with classic notifications requires its own queue/subscription, you are capped at 100 jobs on AWS/GCP and 500 on Azure. Finally,  customers must manually manage the lifecycle of these resources as streams are created or deleted. 

To orchestrate these pipelines, many customers use file arrival triggers, which automatically start the pipeline’s job when new files arrive in cloud storage. However, file arrival triggers have historically relied on directory listing, so they could only support cloud storage directories with fewer than 10,000 objects. 

 

Introducing Auto Loader with File Events

File events eliminate these hurdles by managing cloud notification infrastructure for you—providing the simplicity of directory listing with the performance of classic notifications. File events also enable scalable file arrival triggers, allowing you to automatically run jobs based on cloud storage directories of any size.

When file events are enabled on an external location, a Databricks-managed service creates notification resources, then reads and caches notifications. Auto Loader incrementally discovers new files from this cache using an efficient API that resumes from the last processed file. 

Because this service uses permissions granted via Unity Catalog, users don’t have to configure notification infrastructure, tune complex parameters, or manage queue and subscription lifecycles.  And by using a single queue for the entire external location, rather than for each Auto Loader stream, it’s easier to avoid per-bucket notification limits.  

High-Level Architectural Design Comparison

 

blog-1.png

High-level Auto Loader architecture with classic notifications.

 

With classic notifications, each Auto Loader stream or job requires a separate subscription and Queue. This setup is required even if all these streams point to the same storage container or bucket. 

blog-2.png

High-level Auto Loader architecture with file events.

 

File Events Architecture Deep-Dive 

When you enable cloudFiles.useManagedFileEvents, the system automates the heavy lifting of ingestion.

  1. Set up: When you enable file events on an external location, Databricks creates the necessary cloud subscriptions and queues.
    1. All new external locations have file events enabled by default. When enabling file events on the external location, the storage credential backing that location should have permissions to create cloud subscriptions and queues. If the required permissions are not granted, the user will receive an error. Refer here for the required permissions.
    2. For simplicity, Databricks recommends using “managed” queues (where the user gives permissions to storage credentials to set everything up). However, you can also bring your own queue—in which case the storage credential should have permissions to read from the queue.   
  2. Publish files: This is ingesting files that need to be processed, into your cloud storage location. 
  3. Publish notifications: Enabling file events automatically creates the subscription service(SNS or Event Grid or Pub/Sub), that gets the notifications published to. 
  4. Publish to queue: Events from subscription service get published to Queue. 
  5. Get file events: The Databricks file events service reads messages from Queue.
  6. Store file metadata: The messages read from the queue gets cached.
  7. List objects: When an Auto Loader stream is run with cloudFiles.useManagedFileEvents set to true, it incrementally reads events from the file events cache and processes them.  
    1. If the Auto Loader stream is running for the first time, it does directory listing on the load path to cache the list of existing files. It runs incrementally thereafter.
    2. To ensure no files are missed in the future, it also periodically performs directory listing on the external location. (Since this is managed automatically, the Auto Loader setting cloudFiles.backfillInterval is no longer needed, so this config is ignored.)

Comparison between classic notifications vs file events

Feature

Classic Notifications

File Events

Cloud Infrastructure

Each Auto Loader stream sets up its own dedicated cloud notification resource (e.g., an SNS topic and SQS queue per stream).

Databricks sets up a single notification resource (queue/subscription) per external location.

Notification Service

Auto Loader streams read directly from their dedicated cloud queue.

Auto Loader streams read from a Databricks file events service cache, which is populated by the single cloud queue.

Cloud Resource Limits

Limited by cloud provider caps on the number of notification resources per storage container (e.g., 100/500 per container/account). One stream maps to one queue.

Avoids cloud limits by using a single queue per external location. Many streams can map to one queue.

Permissions

Requires elevated permissions for each stream to create and manage its dedicated cloud notification resources.

Requires elevated permissions granted via Unity Catalog's storage credential for the external location. No extra permissions needed per stream.

Lifecycle Management

Users often have to manually manage notification resources (queues/subscriptions) when streams are deleted or fully refreshed.

Databricks automatically manages the lifecycle of the cloud notification resources.

 

In Practice: How Auto Loader with File Events Supports Customers

Performance & Simplicity

  • Performance Meets Simplicity: You no longer need to decide between the efficiency of classic notifications and the simplicity of directory listing. Auto Loader with File Events offers both.
  • Eliminate Costly/Slow Directory listing: If you’re using Auto Loader in directory listing mode today, you should see significant performance improvements by migrating to File Events. This is because Auto Loader stream doesn’t need to regularly perform directory listing on the load path to discover new files. 
  • Zero Tuning Required for File Discovery: You no longer need to tune the various settings regarding file discovery (e.g., in classic notifications, fetchParallelism, backfillInterval, etc.).

Permissions

  • No Extra Permissions Per Stream: The service principal or user who runs the Auto Loader stream for the first time doesn’t need to have permissions to create subscriptions/queue. They should just have permissions on the input data path to read from.  

Operational Overhead: 

  • One Queue Per External Location, Not Per Stream: Unlike classic notifications, you no longer need to create a queue for each Auto Loader stream; this helps you avoid hitting notification limits on storage containers. We set up just one queue and storage event subscription per external location.
  • Automatic Cloud Resource Lifecycle Management: You no longer need to worry as much about the lifecycle of event Subscriptions & Queues that are created in the cloud (e.g., when an Auto Loader stream is deleted or fully refreshed). These resources are deleted automatically (the storage credential has permissions) once file events are disabled on the external location. 

Refer here for how to migrate from classic notifications to file events. 

 

Setting up File Events: Quick guide

  • All new external locations have file events enabled by default. For existing external locations, you can go to the edit page or create a new external location and select “Advanced Options.” 

screenshot-1.png

 

 

  • Select the File Event Type. If you’d like Databricks to set up subscriptions and events for you, then choose “Automatic” (recommended). If you’ve configured a queue yourself, use “Provided.” Click create. 
  • Lastly, after a few seconds, click “Test connection” in the external location page to confirm that file events have been enabled successfully.

screenshot-2.png

Examples

Here is an example of an Auto Loader stream using file events:

stream_df = (spark.readStream.format("cloudFiles")
 .option("cloudFiles.format", "json")
 .option("cloudFiles.maxFilesPerTrigger", 5000)
 .option("cloudFiles.schemaLocation", f'{chekpoint_path}/schema')
 .option("cloudFiles.useManagedFileEvents", True)
 .load(f'{volume_path}')
 .writeStream
 .queryName("file-events-stream")
 .option("checkpointLocation", f"{chekpoint_path}")
 .trigger(processingTime='30 seconds')
 .table(f"{target_table}")
)

If you’re using Spark Declarative Pipelines today and already have a pipeline with a STREAMING TABLE, update it to include the cloudFiles.useManagedFileEvents option.

CREATE OR REFRESH STREAMING TABLE <table-name>
AS SELECT <select clause expressions>
    FROM cloud_files("abfss://path/to/external/location/or/volume",
                     "<format>", 
                      map(
			        ...
                          "cloudFiles.useManagedFileEvents", "True",
                          ...))

Setting this as spark conf:

spark.conf.set("spark.databricks.cloudFiles.useManagedFileEvents", "true")

Best practices

Create volumes for multiple subdirectories in the same external location

  • If you have multiple subdirectories under the external location, and if you have multiple Auto Loader streams consuming from each of these sub-directories, create and use volumes for each of these subdirectories for faster file discovery with file events. Refer: Use volumes for optimal file discovery 
  • Note: If you have just one Auto Loader stream consuming from the entire external location, this does not apply. 

Use file arrival triggers to optimize your ingestion jobs

  • You can trigger your Auto Loader job to run continuously, using file arrival triggers, or on a schedule. 
  • File arrival triggers are often the best choice because they run the job only when new files appear in cloud storage—enabling truly event-driven orchestration to complement your incremental ingestion.
  • File events make file arrival triggers significantly more performant: 
 

File Events NOT Enabled 

File Events Enabled 

File tracking limit

Up to 10k files on the directory

Unlimited

# File arrival triggers per workspace

Maximum of 50

Unlimited

 

Conclusion 

  • If you are using Auto Loader with directory listing today, we recommend migrating to file events. 
  • If you are using Auto Loader with classic notifications today with < 2000 files ingested per second, we recommend migrating to file events. 
  • File events are Generally Available on AWS, Azure and GCP.