cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

API for Restarting Individual Failed Tasks within a Job?

minhhung0507
Valued Contributor

Hi everyone,

I'm exploring ways to streamline my workflow in Databricks and could really use some expert advice. In my current setup, I have a job (named job_silver) with multiple tasks (e.g., task 1, task 2, task 3). When one of these tasks fails—say task 2—I want the ability to restart just that specific task without rerunning the entire job.

I did some research and came across the “Repair and Rerun” feature (Databricks Blog). While that's a great tool for saving time and money in data and ML workflows, my use case requires more flexibility. Specifically, I'm looking for an API-based solution that I can integrate into my code, allowing dynamic control over which task to restart based on custom logic.

Some points I’m particularly interested in:

  1. Is there an existing API (or a combination of APIs) that allows for restarting individual tasks within a job?

  2. Could this be done via the REST API, and if so, what endpoints or methods should I look at?

  3. Are there any workarounds or best practices for implementing this functionality if a dedicated API is not available?

  4. How might this approach scale in environments with a large number of jobs and complex dependency graphs?

I’d love to hear about your experiences and any code snippets or documentation pointers that could help me get started. Thanks in advance for your insights!

Regards,
Hung Nguyen
15 REPLIES 15

Hey @minhhung0507 
Great questions, let me answer as per my understanding:

Q1: Why do we need to use an interactive (all‑purpose) cluster when submitting a job, rather than a job cluster?

As I'm testing, it's a matter of convenience, nothing more, I couldn't wait to run the job till a job cluster spins up, hence using an interactive cluster. As I figured out you're using a job cluster to pass the request, I suggested to use a different parameter in the JSON payload.

Q2: I can’t find any reference to a parameter called job_cluster_details in the official docs—could you point me to where it’s documented or share a link?

I was checking the RestAPI docs and from the link below, saw how the job_cluster is being referenced. Suggested the same to you.

RiyazAli_0-1745493412644.png

In the example above, "auto_scaling_cluster" is the name of the job cluster.

Link to the doc - https://docs.databricks.com/api/workspace/jobs/get

Also, check this API for repair run, I believe this would be the right URL to hit for your usecase.
https://docs.databricks.com/api/workspace/jobs/repairrun

Let me know your thoughts.

Riz