cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Autoloader Managed File events

Sainath368
Contributor

Hi all,

We are in the process of migrating from directory listing to managed file events in Azure Databricks. Our data is stored in an Azure Data Lake container with the following folder structure:

Sainath368_0-1763538057402.png

To enable file events in Unity Catalog (UC), I created an external location pointing to the /Landing folder in the Unity Catalog and enabled file events.

Now, we are using Structured Streaming with Auto Loader to process files, and I have a question regarding how to specify the load path in the readStream function:

  • Can we specify the load path directly up to the /specificTable folder (e.g., /Landing/specificSchema/specificTable) when using Auto Loader in Structured Streaming? will managed file events work here ? because we enable file events in external location upto Landing only right?

  • Or do we need to keep the load path the same as the external location path (i.e., pointing to /Landing), and then handle the table-specific logic separately in the streaming job?

    Previously, we used directory listing with multi-threading, where we called a separate stream for each table, dynamically specifying different paths for each stream. Given this new setup with file events, we want to know if it's possible to do the same for each table by specifying the path directly down to the table level and setting useManagedFileEvents='True' for each stream or if we are required to keep the load path at the external location level.




2 ACCEPTED SOLUTIONS

Accepted Solutions

Raman_Unifeye
Contributor III

Recommended approach to continue your existing pattern:

  1. Keep the External Location enabled for file events at the high-level path (/Landing).

  2. Run a separate Structured Streaming job for each table, specifying the full sub-path in the .load() function (e.g., /Landing/specificSchema/specificTable).

  3. Set the option cloudFiles.useManagedFileEvents='true' on every stream.

Find more details at Configure Auto Loader streams in file notification mode - Azure Databricks

 

View solution in original post

K_Anudeep
Databricks Employee
Databricks Employee

 

Hello @Sainath368 ,

For managed file events, there would be a single queue per external location, not one per table/stream.
All your table-level Auto Loader streams that read from paths under /Landing will share the same file-event queue, and each stream filters the events based on its own load path (prefix).

Also, as a best practice , itโ€™s recommended to create volumes for every Auto Loader use case on the external location, and use volume URLs in the workload (e.g. /Volumes/catalog/schema/volumename).

Please let me know if this answers your question and please accept this as a solution if it was helpfu.

Anudeep

View solution in original post

4 REPLIES 4

K_Anudeep
Databricks Employee
Databricks Employee

Hello @Sainath368 ,

Below are the answers to your questions:

  • Can we specify the load path directly up to the /specificTable folder (e.g., /Landing/specificSchema/specificTable) when using Auto Loader in Structured Streaming? will managed file events work here ? because we enable file events in external location upto Landing only right?

You can point Auto Loader directly at /Landing/specificSchema/specificTable and still use managed file events, as long as that path is inside the external location that has file events enabled. You do not need the load path to be exactly /Landing.

 

Managed file events are configured on the external location root (/Landing), but they apply to all subfolders under it. So you can still use the autoloader with: 

.option("cloudFiles.useManagedFileEvents", "true") .load("/Landing/specificSchema/specificTable")

NOTE: For each table/stream, as long as the path remains inside that external location and each stream has its own checkpoint.

 

Anudeep

Hi @K_Anudeep

Thank you for the detailed explanation! That clears up the path configuration for using Auto Loader with managed file events.

I have a follow-up question:

If we proceed with one stream per table, will there be a single queue for all streams, or will each table/stream have its own independent queue for file event processing?

This is important for understanding how we can manage concurrent streams, especially when we're using multi-threading to process data for different tables independently.

Looking forward to your insights!

K_Anudeep
Databricks Employee
Databricks Employee

 

Hello @Sainath368 ,

For managed file events, there would be a single queue per external location, not one per table/stream.
All your table-level Auto Loader streams that read from paths under /Landing will share the same file-event queue, and each stream filters the events based on its own load path (prefix).

Also, as a best practice , itโ€™s recommended to create volumes for every Auto Loader use case on the external location, and use volume URLs in the workload (e.g. /Volumes/catalog/schema/volumename).

Please let me know if this answers your question and please accept this as a solution if it was helpfu.

Anudeep

Raman_Unifeye
Contributor III

Recommended approach to continue your existing pattern:

  1. Keep the External Location enabled for file events at the high-level path (/Landing).

  2. Run a separate Structured Streaming job for each table, specifying the full sub-path in the .load() function (e.g., /Landing/specificSchema/specificTable).

  3. Set the option cloudFiles.useManagedFileEvents='true' on every stream.

Find more details at Configure Auto Loader streams in file notification mode - Azure Databricks