cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

SQS messages disappear immediately when File Events enabled on External Location with pre-provisione

truongtran
New Contributor II

Environment

  • Cloud: AWS
  • Unity Catalog: Enabled
  • Auto Loader Mode: File Notification (Legacy) with pre-provisioned SQS

Problem

We are using Auto Loader in legacy file notification mode with a pre-provisioned SQS queue (cloudFiles.useNotifications = true + cloudFiles.queueUrl).The architecture is:

S3 (s3:ObjectCreated:*) → SNS Topic → SQS Queue → Auto Loader

The S3 bucket publishes s3:ObjectCreated:* events to an SNS topic, which fans out to our SQS queue. Auto Loader consumes from this SQS queue.

When we enable File Events on the External Location (Unity Catalog) pointing to the same S3 path, SQS messages start disappearing within seconds of arrival — even though our Auto Loader job is not running at that moment.

SQS Configuration

  • Queue type: Standard
  • Visibility timeout: 1 hour
  • Message retention period: 4 days (default)
  • No other consumers configured on this queue — only Databricks

Observed Behavior

Scenario Result
File Events disabled on External LocationMessages arrive in SQS and remain visible as expected. 
File Events enabled on External Location (with same pre-provisioned SQS)Messages arrive in SQS but disappear within a few seconds. No consumer is running — our Auto Loader job is not active at that time.
File Events disabled again + remove SQS configMessages return to normal behavior — arrive and stay in the queue.

Impact

This led to data loss in our pipeline. New files pushed to S3 generated SQS messages, but those messages were consumed and deleted before our Auto Loader job ran. When the job eventually triggered (daily schedule), the SQS messages were already gone — so Auto Loader saw no new events and did not ingest the new files into the target table.

What We Understand

From the documentation (Auto Loader with file events overview😞

  • The Databricks File Events service listens to file events and caches file metadata.
  • Databricks uses the permissions from the storage credential to read and delete messages from the queue.
  • There is only one queue and storage event subscription per external location.

Questions

  1. Why do SQS messages disappear within seconds even though no Databricks job is running? When File Events is enabled on the External Location, something is consuming and deleting messages from our pre-provisioned SQS queue — but our Auto Loader job is scheduled daily and was not active at that time. What background process is consuming these messages?
  2. Why doesn't the next Auto Loader job run process those events? If the Databricks File Events service consumed the SQS messages to build its internal cache, shouldn't Auto Loader with cloudFiles.useNotifications = true + cloudFiles.queueUrl still be able to discover those files on the next run — either from the cache or from the queue?

Thank you

3 REPLIES 3

Sidhant07
Databricks Employee
Databricks Employee

Hi,

This seems to be an issue, but could you check the cloudfilestate metrics, especially when the file was created, discovered and processed, and check whether the autoloader job was running at that time.

May I know if you tried using managed file events and are facing the same issue?

 

 

SteveOstrowski
Databricks Employee
Databricks Employee

Hi @truongtran,

Thank you for the thorough write-up with the environment details and reproducible scenarios -- that makes it much easier to pinpoint what is happening.

WHAT IS HAPPENING

When you enable File Events on an External Location in Unity Catalog, the Databricks File Events service starts actively consuming and deleting messages from the SQS queue associated with that location. As the documentation states, Databricks uses the permissions from the storage credential to "read and delete messages from the queue."

This is the background process you are seeing: the File Events service itself is the consumer that is draining your pre-provisioned SQS queue, even when your Auto Loader job is not running. The File Events service runs continuously as a managed Databricks service -- it does not depend on your Auto Loader stream being active.

WHY YOUR AUTO LOADER JOB MISSES THE FILES

Here is the sequence of events that leads to the data loss you observed:

1. A new file lands in S3, triggering an S3 event notification to your SNS topic.
2. The SNS topic delivers the message to your pre-provisioned SQS queue.
3. The Databricks File Events service (enabled on the External Location) reads and deletes the message from that SQS queue, caching the file metadata internally.
4. When your daily Auto Loader job runs using legacy file notification mode (cloudFiles.useNotifications = true + cloudFiles.queueUrl), it polls the SQS queue -- but the messages are already gone.
5. Because your job uses legacy mode (not managed file events), it does not read from the File Events cache. It only knows about the SQS queue, which is now empty.

The fundamental problem is that legacy file notification mode and managed File Events are two separate file discovery mechanisms, and they conflict when pointed at the same SQS queue. The File Events service consumes the messages that your legacy Auto Loader job expects to find.

THE ROOT CAUSE: TWO COMPETING CONSUMERS

The Databricks documentation on Auto Loader options explicitly states that these options are mutually exclusive:

- cloudFiles.useNotifications / cloudFiles.queueUrl (legacy file notification mode)
- cloudFiles.useManagedFileEvents (managed file events mode)

When you enable File Events on the External Location, you are activating the managed file events infrastructure. But your Auto Loader job is still configured for legacy mode. The result is two competing consumers on the same queue: the File Events service wins the race because it runs continuously, while your job only runs once daily.

HOW TO RESOLVE THIS

You have two options:

Option 1: Switch to Managed File Events (Recommended)

This is the recommended path going forward. Reconfigure your Auto Loader job to use managed file events instead of legacy file notification mode:

spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "your_format") \
.option("cloudFiles.useManagedFileEvents", "true") \
.load("s3://your-bucket/your-path/")

Remove the cloudFiles.useNotifications and cloudFiles.queueUrl options entirely. With this configuration, Auto Loader reads from the File Events cache instead of polling SQS directly. This approach:
- Requires Databricks Runtime 14.3 LTS or higher
- Requires File Events to be enabled on the External Location (which you already have)
- Does not require you to manage your own SQS queue
- Supports all Auto Loader streams on the same bucket with a single queue

Important: the file events cache holds metadata for files modified in the last 7 days, so you should run Auto Loader at least once every 7 days. For your daily schedule, this is not an issue.

Note that on the very first run after switching, Auto Loader will perform a full directory listing to get current with the file events cache, so it should pick up any files that were missed.

Option 2: Keep Legacy Mode and Disable File Events on the External Location

If you need to stay on legacy file notification mode for now, disable File Events on the External Location to stop the managed service from consuming your SQS messages. This is what you observed working in your third test scenario.

With this option, your existing architecture (S3 -> SNS -> SQS -> Auto Loader) continues to work as before.

RECOVERING MISSED DATA

If you have already lost SQS messages and need to catch up on missed files, you can do a one-time backfill. One approach:

spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", "your_format") \
.option("cloudFiles.useManagedFileEvents", "true") \
.option("cloudFiles.includeExistingFiles", "true") \
.load("s3://your-bucket/your-path/")

Setting includeExistingFiles to true on the first run triggers a full directory listing, which will discover all existing files regardless of whether SQS messages were consumed.

REFERENCES

- Auto Loader with file events overview: https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/file-events-explained
- Configure Auto Loader in file notification mode: https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/file-notification-mode
- Auto Loader configuration options: https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/options
- Manage external locations (File Events): https://docs.databricks.com/aws/en/connect/unity-catalog/manage-external-locations

I hope this helps clarify the behavior you observed. The key takeaway is that enabling File Events on an External Location introduces a background consumer for SQS messages, which conflicts with legacy file notification mode. Switching to managed file events (Option 1) is the cleanest resolution and aligns with the recommended approach going forward.

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.

Really appreciate your response