cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

File Arrival Trigger in Azure Databricks

ShresthaBaburam
New Contributor II

We are using Databricks in combination with Azure platforms, specifically working with Azure Blob Storage (Gen2). We frequently mount Azure containers in the Databricks file system and leverage external locations and volumes for Azure containers.

Our use case involves building several data pipelines in Databricks, and we are currently facing an issue with setting up a file arrival trigger. The goal is to trigger a workflow whenever a new file is dropped into an Azure Blob Storage container (Gen2), and we need to pass the complete file path to the subsequent processor in the workflow.

We would appreciate guidance on how to:

  1. Set up a file arrival trigger in Databricks for Azure Blob Storage (Gen2).
  2. Capture the file path and file name that triggered the event and pass it as a parameter to the next task in the pipeline.

Any advice or best practices to solve this issue would be greatly appreciated!

Thank you for your time and assistance.

Best regards,
Baburam Shrestha

ShresthaBaburam
1 ACCEPTED SOLUTION

Accepted Solutions

Panda
Contributor

@ShresthaBaburam 

We inquired about this a few days ago and checked with Databricks. They were working on the issue, but no ETA was provided. You can find more details here: Databricks Community Link.

However, to address this use case, we followed the steps below:

  1. Configure Autoloader with Directory Listing. [ P:S:- use trigger(availableNow=True) ]
  2. Capture the File Path: Use the _metadata column to capture the file path of the newly arrived file:
    df_with_path = df.withColumn("input_file_path", input_file_name())
  3. Pass the File Path to the Next Task: Once the file path is captured, pass it to the next task in the pipeline using the appropriate workflow or task parameter mechanism.

I hope this helps. 

View solution in original post

1 REPLY 1

Panda
Contributor

@ShresthaBaburam 

We inquired about this a few days ago and checked with Databricks. They were working on the issue, but no ETA was provided. You can find more details here: Databricks Community Link.

However, to address this use case, we followed the steps below:

  1. Configure Autoloader with Directory Listing. [ P:S:- use trigger(availableNow=True) ]
  2. Capture the File Path: Use the _metadata column to capture the file path of the newly arrived file:
    df_with_path = df.withColumn("input_file_path", input_file_name())
  3. Pass the File Path to the Next Task: Once the file path is captured, pass it to the next task in the pipeline using the appropriate workflow or task parameter mechanism.

I hope this helps. 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group