cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Autoloader cluster

RK_AV
New Contributor III

I wanted to setup Autoloader to process files from Azure Data Lake (Blob) automatically whenever new files arrive. For this to work, I wanted to know if AutoLoader requires that the cluster is on all the time.

1 ACCEPTED SOLUTION

Accepted Solutions

Kaniz_Fatma
Community Manager
Community Manager

Hi @Venkata Ramakrishna Alvakondaโ€‹ , If your data only arrives every hour or every day, for example, you can run autoloader in batch mode by specifying the "trigger once" option and then setting up the notebook to run a scheduled job.

In "trigger once" mode, Autoloader still keeps track of new files even when there's not an active cluster running - it just waits to process until you rerun the AutoLoader code manually or as part of a scheduled job. 

View solution in original post

6 REPLIES 6

Hubert-Dudek
Esteemed Contributor III

@Venkata Ramakrishna Alvakondaโ€‹ , No it is not required. Last position is stored in checkpoint file. New files are detected by directory listings or are stored in queue.

RK_AV
New Contributor III

@Hubert Dudekโ€‹ , Thank you for your response. My question was: Does the cluster have to be on all the time to take advantage of Auto Loader? What happens if a file arrives in the blob storage and the cluster was down. Does it automatically start the cluster and then invoke the autoloader process to read the file? Or does the next time the cluster starts, it gets picked up?

Kaniz_Fatma
Community Manager
Community Manager

Hi @Venkata Ramakrishna Alvakondaโ€‹ , If your data only arrives every hour or every day, for example, you can run autoloader in batch mode by specifying the "trigger once" option and then setting up the notebook to run a scheduled job.

In "trigger once" mode, Autoloader still keeps track of new files even when there's not an active cluster running - it just waits to process until you rerun the AutoLoader code manually or as part of a scheduled job. 

RK_AV
New Contributor III

Thanks @Kaniz Fatmaโ€‹  for the response. Unfortunately I dont have a set frequency for the arrival of files. It is very adhoc. Let me ask you this question. Is it possible for event grid to trigger a Databricks job?

Hubert-Dudek
Esteemed Contributor III

For sure you can try logic apps for triggering when something is in event grid and then notebook run.

asif5494
New Contributor III

@Kaniz Fatmaโ€‹ , If my cluster is not active, and I have uploaded 50 files in storage location, then where this Auto Loader will list out these 50 files. Will it use any checkpoint location, if yes, then how can I set the checkpoint location in Cloud Storage for these new files identification? Can please tell me the backend process that is used to identifying these new files if my cluster is not active?

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group