cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Migrating from directory-listing to Autoloader Managed File events

Sainath368
Contributor

Hi everyone,

We are currently migrating from a directory listing-based streaming approach to managed file events in Databricks Auto Loader for processing our data in structured streaming.

We have a function that handles structured streaming where we are reading data from specific folder paths (per table) and writing it to a Delta staging table. Previously, we were polling directories for new files. Now, we're switching to using managed file events for event-driven processing.

Our use case involves processing multiple tables concurrently. We're leveraging multithreading to execute the streaming function for each table in parallel, with each table processed in a separate thread.

Hereโ€™s how weโ€™re setting it up:

  1. Multiple tables have their own directories (e.g., container_path/system/table1/, container_path/system/table2/, etc.).

  2. Each table is processed with structured streaming using Auto Loader and managed file events (cloudFiles.useManagedFileEvents = true).

  3. We are multithreading the processing, so each table runs in a separate thread, with each thread initializing its own stream for its specific table directory.

My question:

Given this setup, will Databricks create a separate event queue for each table (i.e., each stream)? In other words, for each stream running for a specific table, will there be an independent event queue that listens for file events for that particular table directory? We are processing multiple tables simultaneously, so we need to ensure that each tableโ€™s events are managed independently.

To summarize:

  • We are running one stream per table in parallel using multithreading.

  • Each stream listens to a specific directory (table) and writes data to a corresponding Delta staging table.

  • Managed file events are used to trigger the processing when new files are added.

Are multiple queues created automatically โ€” one for each stream and corresponding table path โ€” or does the system handle the event queuing in some other way?

Looking forward to your insights!

Thanks in advance!

 

3 REPLIES 3

szymon_dybczak
Esteemed Contributor III

Hi @Sainath368 ,

File notification mode supports 2 modes:

- File events (recommended) - in this case you use aa single file notification queue for all streams that process files from a give external location.

- Legacy file notification mode - You manage file notification queues for each Auto Loader stream separately. Auto Loader automatically sets up a notification service and queue service that subscribes to file events from the input directory.

So, in your case you're using File Events so you will have a single queue for all streams that process files from a given external location.

Here you can read more about it:

Configure Auto Loader streams in file notification mode - Azure Databricks | Microsoft Learn

szymon_dybczak_0-1763388773071.png

 

Raman_Unifeye
Contributor III

Yes, for your setup, Databricks Auto Loader will create a separate event queue for each independent stream running with the cloudFiles.useManagedFileEvents = true option.

As you are running - 1 stream per table, 1 unique directory per stream and 1 unique checkpoint per stream

The result is multiple, independent event queuesโ€”one tied to each stream/table path


RG #Driving Business Outcomes with Data Intelligence

Hi @Raman_Unifeye @szymon_dybczak , Considering my setup which i already explained, does it make any difference between ManagedFileEvents  and Legacy file notification mode ? if yes in what aspects.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now