cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

File information is not passed to trigger job on file arrival

Rik
New Contributor III

We are using the UC mechanism for triggering jobs on file arrival, as described here: https://learn.microsoft.com/en-us/azure/databricks/workflows/jobs/file-arrival-triggers.

Unfortunately, the trigger doesn't actually pass the file-path that is generating the trigger to the job... (The Run Parameters are empty). Is there any way to get this information?

1 ACCEPTED SOLUTION

Accepted Solutions

Tharun-Kumar
Databricks Employee
Databricks Employee

@Rik 

We have got this request from other Customers too. Our Engineering team is already notified of this and there is an internal ticket for the same. But we don't have an ETA for now.

View solution in original post

11 REPLIES 11

Tharun-Kumar
Databricks Employee
Databricks Employee

@Rik 

For now, we do not send the file details as part of the trigger. The trigger is used to run a pipeline.

Alternately, You can use autoloader as part of the triggered pipeline to get the details of the file that arrived.

Rik
New Contributor III

"Alternately, You can use autoloader as part of the triggered pipeline to get the details of the file that arrived."

That doesn't quite fit our requirements unfortunately... Are there any plans on adding this functionality?

mattiazeni
Databricks Employee
Databricks Employee

What do you need to achieve?

Autoloader is much more efficient since it can handle a bunch of files (only new ones) in a single operation. Handling file by file, especially with a lot of files, will increase latency and increase costs.

What I wanted to achieve was a dynamic schema application based on what file was picked up.
So I implement 1 autoloader task to collect files from a specific path "source":
- source/employees/0001.csv
- source/holiday/0001.csv

If the path of the file was available I could then apply the relevant schema in runtime.
But autoloader may want to process both files and put them into the same dataframe?
Maybe this isn't the best usecase, I guess you would recommend to implement multiple tasks/checkpoints for the respective folders?

Tharun-Kumar
Databricks Employee
Databricks Employee

@Rik 

We have got this request from other Customers too. Our Engineering team is already notified of this and there is an internal ticket for the same. But we don't have an ETA for now.

Any ETA ? We are having to use other Orchestration products because of this limitation. 

Panda
Valued Contributor

Could you please provide an update on the status of this particular request? Additionally, do we have any ETA for it?

marcuskw
Contributor II

Also something I'm interested in using, would be really helpful to use File Trigger and get relevant information about exactly what file triggered the workflow!

artemich
New Contributor II

Same here!

Additionally would be great to enhance it to support not just the path to a directory, but also the prefix of the file name (or regex for bonus points). Right now if you have 10 types of files arriving to the same folder, it would be much cleaner to have each workflow handling a given type only process the relevant file arrived. 

You are able to provide filter options to select only relevant files:
https://docs.databricks.com/en/ingestion/cloud-object-storage/auto-loader/patterns.html#filtering-di...

 

artemich
New Contributor II

For loading file with AutoLoader - for sure. My wish is to have similar capability for File Arrival Trigger.

"A file arrival trigger can be configured to monitor the root of a Unity Catalog external location or volume, or a subpath of an external location or volume."
https://learn.microsoft.com/en-us/azure/databricks/jobs/file-arrival-triggers

Quite often files for multiple data entities (or even pipelines) land in the same directories from a given provider and it would be great to be able to easily manage such scenarios.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group