Hi everyone,
We are currently migrating from a directory listing-based streaming approach to managed file events in Databricks Auto Loader for processing our data in structured streaming.
We have a function that handles structured streaming where we are reading data from specific folder paths (per table) and writing it to a Delta staging table. Previously, we were polling directories for new files. Now, we're switching to using managed file events for event-driven processing.
Our use case involves processing multiple tables concurrently. We're leveraging multithreading to execute the streaming function for each table in parallel, with each table processed in a separate thread.
Hereโs how weโre setting it up:
Multiple tables have their own directories (e.g., container_path/system/table1/, container_path/system/table2/, etc.).
Each table is processed with structured streaming using Auto Loader and managed file events (cloudFiles.useManagedFileEvents = true).
We are multithreading the processing, so each table runs in a separate thread, with each thread initializing its own stream for its specific table directory.
My question:
Given this setup, will Databricks create a separate event queue for each table (i.e., each stream)? In other words, for each stream running for a specific table, will there be an independent event queue that listens for file events for that particular table directory? We are processing multiple tables simultaneously, so we need to ensure that each tableโs events are managed independently.
To summarize:
We are running one stream per table in parallel using multithreading.
Each stream listens to a specific directory (table) and writes data to a corresponding Delta staging table.
Managed file events are used to trigger the processing when new files are added.
Are multiple queues created automatically โ one for each stream and corresponding table path โ or does the system handle the event queuing in some other way?
Looking forward to your insights!
Thanks in advance!