cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

File Trigger Not Triggering Multiple Runs

Sneeze7432
New Contributor III

I have a job with one task which is to run a notebook.  The job run is setup with a File arrival trigger with my blob storage as the location.  The trigger works and the job will start when a new file arrives in the location, but it does not run for multiple files.

For example I had three files uploaded at different times.  First at 3:57:03, second at 3:57:07, and the last at 3:57:10.  Three new files arrived, but only one job was started.  Why did three jobs not get queued to run?

13 REPLIES 13

nayan_wylde
Honored Contributor

Did you overwrite the file with the same name because overwriting an existing file with a file of the same name does not trigger a run.

No each file had a unique name associated with them.

nayan_wylde
Honored Contributor

Check if you have configured this two options.

nayan_wylde_0-1752096329515.png

 

They are both set to 00h 00m.

szymon_dybczak
Esteemed Contributor III

Hi @Sneeze7432 ,

I think it could be caused by following option Wait after last change in seconds. According to documentation:

"The time to wait to trigger a run after file arrival. Another file arrival in this period resets the timer. This settings can be used when files arrive in batches, and the whole batch needs to be processed after all files have arrived."

An important thing to keep in mind is that "another file arrival in this period resets the timer". Put differently, if you've continuously arriving files, your Workflow will never start as its execution will be continuously delayed. For that reason this setting should be used only to optimize the batch of processed files.

I have the "Wait after last change" setting set to 00h 00m which I would assume means that immediately after a file drops in the storage location the job run will start.  I would also assume that means if I drop multiple files in the same location multiple jobs should start, and based on my concurrency limits some may have to be queued.

szymon_dybczak
Esteemed Contributor III

I'm just guessing, because unfortunately we don't have insight into how this was implemented, but it seems to me that the Databricks engineers treat files uploaded within a short time interval as a single batch โ€” most likely for optimization purposes. If a trigger were to be generated every second, it wouldnโ€™t be a very efficient approach.
Even that option is specified in minutes (as if they assume that anything below that would still be treated as a single batch).

szymon_dybczak_0-1752131111862.png

 

What doesn't make sense is that the notification bar will tell me "3 new files" but only one job runs.  So even though they can display the number of new files between checks it will still only do one job?

I don't know, it doesn't seem to be setup very well.

szymon_dybczak
Esteemed Contributor III

Maybe some databricks employee will jump in and will shed some light about implementation details. But for me treating really short intervals as one batch is quite reasonable approach to avoid massive amount of triggers.

Same, I would really appreciate more details around this.

MariuszK
Valued Contributor III

It looks like the trigger process files in batches, which means that each of the files uploaded doesn't create a new instance of a job. 

  • Wait after last change in seconds: The time to wait to trigger a run after file arrival. Another file arrival in this period resets the timer. This setting can be used when files arrive in batches, and the whole batch needs to be processed after all files have arrived.

If you need to process files immediately or separately, you can play with Auto Loader configuration.

 

nayan_wylde
Honored Contributor

nayan_wylde_0-1752158776005.png

@Sneeze7432 you can also try editing the max concurrent runs in the workflow. 

That doesn't solve the problem of jobs not queueing. That would actually not be good because I could have multiple jobs writing to the same location and potentially overwriting each other creating inaccurate data.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now