Databricks Community

sanjay · ‎02-01-2024

Hi,

I am running autoloader with continuous trigger. How can I stop this trigger during some specific time, only if no data pending and current batch process is complete. How to check how many records pending in queue and current state.

Regards,

Sanjay

Kaniz_Fatma · ‎02-11-2024

Hi @sanjay, Looking to effectively manage your autoloader's continuous trigger? Follow these steps for seamless execution:

Pausing the Trigger at Specific Times: If you need to halt the continuous trigger during certain hours, consider switching to a triggered pipeline. However, if you prefer to stick with the continuous trigger, programmatically stop it through the workspace client or the Job Rest API1. Just be sure to handle this approach carefully to avoid any data loss.

To keep track of the number of records still waiting in the pipeline, your best bet is to keep an eye on the associated Structured Streaming job for your autoloader. While there isn't a direct method to retrieve the exact count, you can gain insight by monitoring the job's progress and checkpoint files. It's worth noting that if the job gets halted during a micro-batch, any changes to the trigger interval won't take effect until that batch is complete. This can therefore serve as a useful gauge for determining the presence of pending records. Just bear in mind that Databricks Auto Loader operates using Structured Streaming, so familiarizing yourself with how triggers function will ultimately allow for better control over costs and optimal data ingestion.

View solution in original post

melbourne · ‎02-02-2024

You can switch to 'Triggered' pipeline in this case.

Next, create a job in workflow and attach a trigger of type 'file arrival' to it. Next add the notebook and cluster to the job. If you're not using DLT, then set cluster timeout minutes to 0, so that your cluster shuts down immediately after it's inactive.

Now, whenever a file will arrive in your landing location, this trigger will go off and will start the cluster which will then run the notebook until it finishes the task.

sanjay · ‎02-03-2024

Thank you melbourne. I can not switch to triggered pipeline for now. Is it possible to stop/pause using workspace client or Job Rest API?

Thanks,

Sanjay

RamonaMraz · ‎02-08-2024

Hello, I am new here, Can I ask a question?

Kaniz_Fatma · ‎02-11-2024

Hi @sanjay, Looking to effectively manage your autoloader's continuous trigger? Follow these steps for seamless execution:

Pausing the Trigger at Specific Times: If you need to halt the continuous trigger during certain hours, consider switching to a triggered pipeline. However, if you prefer to stick with the continuous trigger, programmatically stop it through the workspace client or the Job Rest API1. Just be sure to handle this approach carefully to avoid any data loss.

To keep track of the number of records still waiting in the pipeline, your best bet is to keep an eye on the associated Structured Streaming job for your autoloader. While there isn't a direct method to retrieve the exact count, you can gain insight by monitoring the job's progress and checkpoint files. It's worth noting that if the job gets halted during a micro-batch, any changes to the trigger interval won't take effect until that batch is complete. This can therefore serve as a useful gauge for determining the presence of pending records. Just bear in mind that Databricks Auto Loader operates using Structured Streaming, so familiarizing yourself with how triggers function will ultimately allow for better control over costs and optimal data ingestion.

Databricks Community

stop autoloader with continuous trigger programatically

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 10 October - 31 October

Databricks Community Social | 30 September 2024 | 8AM PST

Intelligent Data Engineering: Beyond the AI Hype

GenAI: The Shift to Data Intelligence

Big Book of Data Engineering — 3rd Edition