Summary
Databricks customers rely on Auto Loader to access cloud storage data at scale. Now, file events are taking Auto Loader to the next level—delivering the speed and efficiency of incremental, notification-based processing alongside simplified setup, automated lifecycle management, and seamless scalability. The result is faster ingestion, less operational overhead, and infrastructure that scales as your data grows.
Historically, Databricks customers have chosen between two modes of Auto Loader. Directory listing is simple to set up but can be expensive and slow for high file volumes, since it requires repeatedly scanning the cloud storage directory to identify any new files. Classic notifications are more efficient, but they require permissions for setting up notification resources (like SQS or Event Grid), in addition to those granted via Unity Catalog.
Cloud providers also limit the number of event notifications per single storage container. Since each Auto Loader job with classic notifications requires its own queue/subscription, you are capped at 100 jobs on AWS/GCP and 500 on Azure. Finally, customers must manually manage the lifecycle of these resources as streams are created or deleted.
To orchestrate these pipelines, many customers use file arrival triggers, which automatically start the pipeline’s job when new files arrive in cloud storage. However, file arrival triggers have historically relied on directory listing, so they could only support cloud storage directories with fewer than 10,000 objects.
File events eliminate these hurdles by managing cloud notification infrastructure for you—providing the simplicity of directory listing with the performance of classic notifications. File events also enable scalable file arrival triggers, allowing you to automatically run jobs based on cloud storage directories of any size.
When file events are enabled on an external location, a Databricks-managed service creates notification resources, then reads and caches notifications. Auto Loader incrementally discovers new files from this cache using an efficient API that resumes from the last processed file.
Because this service uses permissions granted via Unity Catalog, users don’t have to configure notification infrastructure, tune complex parameters, or manage queue and subscription lifecycles. And by using a single queue for the entire external location, rather than for each Auto Loader stream, it’s easier to avoid per-bucket notification limits.
High-level Auto Loader architecture with classic notifications.
With classic notifications, each Auto Loader stream or job requires a separate subscription and Queue. This setup is required even if all these streams point to the same storage container or bucket.
High-level Auto Loader architecture with file events.
When you enable cloudFiles.useManagedFileEvents, the system automates the heavy lifting of ingestion.
|
Feature |
Classic Notifications |
File Events |
|
Cloud Infrastructure |
Each Auto Loader stream sets up its own dedicated cloud notification resource (e.g., an SNS topic and SQS queue per stream). |
Databricks sets up a single notification resource (queue/subscription) per external location. |
|
Notification Service |
Auto Loader streams read directly from their dedicated cloud queue. |
Auto Loader streams read from a Databricks file events service cache, which is populated by the single cloud queue. |
|
Cloud Resource Limits |
Limited by cloud provider caps on the number of notification resources per storage container (e.g., 100/500 per container/account). One stream maps to one queue. |
Avoids cloud limits by using a single queue per external location. Many streams can map to one queue. |
|
Permissions |
Requires elevated permissions for each stream to create and manage its dedicated cloud notification resources. |
Requires elevated permissions granted via Unity Catalog's storage credential for the external location. No extra permissions needed per stream. |
|
Lifecycle Management |
Users often have to manually manage notification resources (queues/subscriptions) when streams are deleted or fully refreshed. |
Databricks automatically manages the lifecycle of the cloud notification resources. |
Performance & Simplicity
Permissions
Operational Overhead:
Refer here for how to migrate from classic notifications to file events.
Here is an example of an Auto Loader stream using file events:
stream_df = (spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.maxFilesPerTrigger", 5000)
.option("cloudFiles.schemaLocation", f'{chekpoint_path}/schema')
.option("cloudFiles.useManagedFileEvents", True)
.load(f'{volume_path}')
.writeStream
.queryName("file-events-stream")
.option("checkpointLocation", f"{chekpoint_path}")
.trigger(processingTime='30 seconds')
.table(f"{target_table}")
)
If you’re using Spark Declarative Pipelines today and already have a pipeline with a STREAMING TABLE, update it to include the cloudFiles.useManagedFileEvents option.
CREATE OR REFRESH STREAMING TABLE <table-name>
AS SELECT <select clause expressions>
FROM cloud_files("abfss://path/to/external/location/or/volume",
"<format>",
map(
...
"cloudFiles.useManagedFileEvents", "True",
...))
Setting this as spark conf:
spark.conf.set("spark.databricks.cloudFiles.useManagedFileEvents", "true")
Create volumes for multiple subdirectories in the same external location
|
File Events NOT Enabled |
File Events Enabled |
|
|
File tracking limit |
Up to 10k files on the directory |
Unlimited |
|
# File arrival triggers per workspace |
Maximum of 50 |
Unlimited |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.