Databricks Community

ChrisLawford_n1 · ‎07-29-2025

Hello,

I am trying to set up autoloader using file notifications but as the storage account we are reading from is a premium storage account we have setup event subscriptions to pump the blob events to queues that exist on a standard gen 2 storage account.

Storage account 1 (blobs):
RG: RG-001
Type: BlockBlobStorage

Storage account 2 (queues):
RG: RG-002
Type: StorageV2 (general purpose v2)

The databricks volume is linked with the storage account 1 and because of this we put the rest of the details into autoloader for the queue storage account thinking that this was the only way to tell autoloader where the queues exist.

At the moment we are configuring like below:

spark.readStream.format("cloudFiles")
.options(format='parquet', client_id='XXXXXXX', client_secret='XXXXXXXXX', tenant_id='XXXXXXX', subscription_id='XXXXXX', resource_group='RG-002', queue_root='cd', include_existing_files=False, use_notifications=True))
.load(f"{volume}{data_path_structure}{table}/*")

this is running but no data is being processed.
My suspicion is that however autoloader actually uses those messages it is not recognizing that the storage account that it should be getting the data from is different from the one that the queue is on.

Is there something I am missing?

lingareddy_Alva · ‎07-29-2025

Hi @ChrisLawford_n1

So in your case, it's not able to resolve the file paths from the event notifications
because they're pointing to a different storage account (Storage Account 1), which is not associated with the queue.

Use a StorageV2 Account for Both Blob and Queue
- Migrate your blob storage to a StorageV2 (Standard Gen2) account (i.e., move off Premium).
- Then configure native file notification support in the same account:
- Event Grid → Storage Queue
- Autoloader will now recognize and handle this properly

Custom Event Relay
If you must keep blob data in Premium:
- Use Azure Functions or Azure Logic Apps to read events from the Storage Account 2 queue and re-write or forward them to a queue on Storage Account 1 (or to a compatible service).
- But this is complex and fragile, and generally not recommended unless absolutely required.

Recommendation
For Autoloader with file notifications to work reliably and correctly, you should:
- Move data ingestion to a StorageV2 (Standard Gen2) account.
- Keep blob and queue in the same storage account.
- Let Databricks handle everything without extra routing logic.

or Use

Use Event Hubs or Service Bus instead of Storage Queues:

- Reconfigure Event Grid: Change your Event Grid subscription endpoint from Storage Queue to Service Bus Topic/Queue or Event Hub
- Update Autoloader config: Use Service Bus connection string instead of storage account details
- This bypasses the cross-storage account issue since Service Bus isn't tied to a specific storage account

LR

ChrisLawford_n1 · ‎07-29-2025

Hey,

I agree it would be ideal to have the data on a storage account that supports queues but unfortunately this is not in my control.
Regarding your option:

Use Event Hubs or Service Bus instead of Storage Queues:

- Reconfigure Event Grid: Change your Event Grid subscription endpoint from Storage Queue to Service Bus Topic/Queue or Event Hub
- Update Autoloader config: Use Service Bus connection string instead of storage account details
- This bypasses the cross-storage account issue since Service Bus isn't tied to a specific storage account

Can you please help me to understand how to configure autoloader to do this. I can't see any such options in the documentation:
https://learn.microsoft.com/en-us/azure/databricks/ingestion/cloud-object-storage/auto-loader/option...

And regarding this option:

Custom Event Relay
If you must keep blob data in Premium:
- Use Azure Functions or Azure Logic Apps to read events from the Storage Account 2 queue and re-write or forward them to a queue on Storage Account 1 (or to a compatible service).
- But this is complex and fragile, and generally not recommended unless absolutely required.

Being that Storage account 1 is a premium storage account and doesn't support queues, we made an azure event subscription from the storage account system topics on Storage account 1 and having them output to a queue on storage account 2. I assume you are mainly talking about using a compatible service such as event hub in your other option.

Just for a bit of a rant:
I would like to know why you couldn't have a queue on a separate storage account which is being populated with messages from another storage account work. It appears that the AWS version having the queueurl being completely separate from the storage works exactly as I would expect.
The messages on the queue being the system topic events have all the information to get the correct data:
{"topic":"/subscriptions/{SUB_ID}/resourceGroups/{RG_001}/providers/Microsoft.Storage/storageAccounts/{SA-001}","subject":"/blobServices/default/containers/rawzone/blobs/{PATH_TO_PARQUET}","eventType":"Microsoft.Storage.BlobCreated","id":"{ID}","data":{"api":"CreateFile","clientRequestId":"{CLIENT_REQUEST_ID}","requestId":"{REQUEST_ID}","eTag":"{ETAG}","contentType":"application/octet-stream","contentLength":0,"contentOffset":0,"blobType":"BlockBlob","blobProperties":[{"acl":[{"access":"u::rw,g::r,o::","permission":"0640","owner":"{OWNER_ID}","group":"$superuser"}]}],"blobUrl":"https://{SA-001}.blob.core.windows.net/rawzone/{PATH_TO_PARQUET}","url":"https://{SA-001}.dfs.core.windows.net/rawzone/{PATH_TO_PARQUET}","sequencer":"00000000000000000000000000031001000000000000dd40","identity":"{ID}","storageDiagnostics":{"batchId":"{BATCH_ID}"}},"dataVersion":"3","metadataVersion":"1","eventTime":"2025-07-29T12:47:45.7969224Z"}

The event contains the storage account information in the topic and bloburl fields.
Interestingly they must be using either the subject field to get the relative path to the parquet or the blob urls as no other fields contain the path to the parquet. So seems like they have the storageaccount name where the data actually is located but must not be using it. I guess It would be nice to know how autoloader is working under the hood.
My expectation would be that the autoloader configuration would be fine to have the path be to the location that you want to read data from and the additional configuration to read a queue that contains messages be able to come from anywhere.

Databricks Community

Autoloader with file notifications on a queue that is on a different storage account to the blobs

Join Us as a Local Community Builder!

🌟 Community Pulse: Your Weekly Roundup! November 21 – 27, 2025

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples