<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Need to automatically rerun the failed jobs in databricks in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/need-to-automatically-rerun-the-failed-jobs-in-databricks/m-p/89751#M37898</link>
    <description>&lt;P&gt;Thanks for the script.&lt;/P&gt;&lt;P&gt;But I need to write a code within the existing notebook, that code should filter and retrigger the notebook if it will fail due to any timeout/server issues.&lt;/P&gt;&lt;P&gt;Can you please help me with the script&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Fri, 13 Sep 2024 09:49:14 GMT</pubDate>
    <dc:creator>Vishalakshi</dc:creator>
    <dc:date>2024-09-13T09:49:14Z</dc:date>
    <item>
      <title>Need to automatically rerun the failed jobs in databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/need-to-automatically-rerun-the-failed-jobs-in-databricks/m-p/89074#M37680</link>
      <description>&lt;P&gt;Hi all,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I need to retrigger the failed jobs automatically in data bricks, can you please help me with all the possible ways to make it possible&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 08 Sep 2024 14:53:19 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/need-to-automatically-rerun-the-failed-jobs-in-databricks/m-p/89074#M37680</guid>
      <dc:creator>Vishalakshi</dc:creator>
      <dc:date>2024-09-08T14:53:19Z</dc:date>
    </item>
    <item>
      <title>Re: Need to automatically rerun the failed jobs in databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/need-to-automatically-rerun-the-failed-jobs-in-databricks/m-p/89077#M37683</link>
      <description>&lt;P&gt;have a look at this link -&amp;nbsp;&lt;A href="https://docs.databricks.com/en/jobs/settings.html#retry-policies" target="_blank"&gt;https://docs.databricks.com/en/jobs/settings.html#retry-policies&lt;/A&gt;&lt;/P&gt;&lt;P&gt;you can set retry logic for tasks or have the job run in a loop and check manually for the status and re-run if not successful.&lt;/P&gt;</description>
      <pubDate>Sun, 08 Sep 2024 15:23:57 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/need-to-automatically-rerun-the-failed-jobs-in-databricks/m-p/89077#M37683</guid>
      <dc:creator>ranged_coop</dc:creator>
      <dc:date>2024-09-08T15:23:57Z</dc:date>
    </item>
    <item>
      <title>Re: Need to automatically rerun the failed jobs in databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/need-to-automatically-rerun-the-failed-jobs-in-databricks/m-p/89088#M37686</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/47980"&gt;@ranged_coop&lt;/a&gt;&amp;nbsp;,&lt;BR /&gt;&lt;BR /&gt;To automatically retrigger failed jobs in Databricks within the last 24 hours, you can use the Databricks REST API to list the jobs, filter out the failed runs, and then retrigger those failed jobs. Below is a Python script that will help you achieve this.&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;import requests
import datetime

# Databricks configurations
DATABRICKS_HOST = "&amp;lt;your workspace url&amp;gt;"  # Replace with your Databricks workspace URL
DATABRICKS_TOKEN = "&amp;lt;your personal access token&amp;gt;"  # Replace with your Databricks Personal Access Token

# API endpoints
LIST_RUNS_ENDPOINT = f"{DATABRICKS_HOST}/api/2.1/jobs/runs/list"
RUN_NOW_ENDPOINT = f"{DATABRICKS_HOST}/api/2.1/jobs/run-now"

# Headers for API requests
HEADERS = {
    "Authorization": f"Bearer {DATABRICKS_TOKEN}"
}
print(datetime.datetime.now())
# Get the timestamp for 24 hours ago
twenty_four_hours_ago = datetime.datetime.now() - datetime.timedelta(hours=24)

def get_failed_runs():
    """
    Retrieve all failed job runs within the last 24 hours.
    """
    failed_runs = []
    has_more = True
    offset = 0
    
    while has_more:
        # Fetch job runs with pagination
        response = requests.get(LIST_RUNS_ENDPOINT, headers=HEADERS, params={"offset": offset, "limit": 25})
        data = response.json()
        
        # Check if data fetch was successful
        if response.status_code != 200 or "runs" not in data:
            print(f"Failed to fetch job runs: {data.get('message', 'Unknown error')}")
            break
        
        for run in data["runs"]:
            # Check if the run failed and was within the last 24 hours
            run_end_time = datetime.datetime.fromtimestamp(run["end_time"] / 1000)
            if "state" in run and "result_state" in run["state"] and run["state"]["result_state"] == "FAILED" and run_end_time &amp;gt; twenty_four_hours_ago:
                failed_runs.append(run)
        
        # Check for more runs
        has_more = data.get("has_more", False)
        offset += 25  # Increment offset to fetch next set of runs
    
    return failed_runs

def retrigger_failed_runs(failed_runs):
    """
    Retrigger all failed job runs.
    """
    for run in failed_runs:
        job_id = run["job_id"]
        print(f"Retriggering job ID: {job_id}, Run ID: {run['run_id']}")
        response = requests.post(RUN_NOW_ENDPOINT, headers=HEADERS, json={"job_id": job_id})
        
        if response.status_code == 200:
            print(f"Successfully retriggered job {job_id}.")
        else:
            print(f"Failed to retrigger job {job_id}: {response.json().get('message', 'Unknown error')}")

# Main script execution
if __name__ == "__main__":
    failed_runs = get_failed_runs()
    if failed_runs:
        print(f"Found {len(failed_runs)} failed runs in the last 24 hours.")
        retrigger_failed_runs(failed_runs)
    else:
        print("No failed runs found in the last 24 hours.")&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 08 Sep 2024 17:10:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/need-to-automatically-rerun-the-failed-jobs-in-databricks/m-p/89088#M37686</guid>
      <dc:creator>filipniziol</dc:creator>
      <dc:date>2024-09-08T17:10:34Z</dc:date>
    </item>
    <item>
      <title>Re: Need to automatically rerun the failed jobs in databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/need-to-automatically-rerun-the-failed-jobs-in-databricks/m-p/89750#M37897</link>
      <description>&lt;P&gt;I tried with retry logic, but I need to trigger those job if they fail only due to timeout/server issues, Can you help me on this&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 13 Sep 2024 09:46:26 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/need-to-automatically-rerun-the-failed-jobs-in-databricks/m-p/89750#M37897</guid>
      <dc:creator>Vishalakshi</dc:creator>
      <dc:date>2024-09-13T09:46:26Z</dc:date>
    </item>
    <item>
      <title>Re: Need to automatically rerun the failed jobs in databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/need-to-automatically-rerun-the-failed-jobs-in-databricks/m-p/89751#M37898</link>
      <description>&lt;P&gt;Thanks for the script.&lt;/P&gt;&lt;P&gt;But I need to write a code within the existing notebook, that code should filter and retrigger the notebook if it will fail due to any timeout/server issues.&lt;/P&gt;&lt;P&gt;Can you please help me with the script&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 13 Sep 2024 09:49:14 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/need-to-automatically-rerun-the-failed-jobs-in-databricks/m-p/89751#M37898</guid>
      <dc:creator>Vishalakshi</dc:creator>
      <dc:date>2024-09-13T09:49:14Z</dc:date>
    </item>
    <item>
      <title>Re: Need to automatically rerun the failed jobs in databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/need-to-automatically-rerun-the-failed-jobs-in-databricks/m-p/90760#M37990</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/119765"&gt;@Vishalakshi&lt;/a&gt;&amp;nbsp;,&lt;BR /&gt;&lt;BR /&gt;I have responded during the weekend, but it seems the responses were lost.&lt;BR /&gt;You have here the run object. For example the current criteria is to return only runs where run[state][result_state] == "FAILED" so basically all failed jobs.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="filipniziol_0-1726589554376.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/11290i4C3E577A1D842B6F/image-size/medium?v=v2&amp;amp;px=400" role="button" title="filipniziol_0-1726589554376.png" alt="filipniziol_0-1726589554376.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;What you can do is to print(run) and you will see the object structure:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="filipniziol_1-1726589796071.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/11291i80B94B3B042A2B08/image-size/medium?v=v2&amp;amp;px=400" role="button" title="filipniziol_1-1726589796071.png" alt="filipniziol_1-1726589796071.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;For example to rerun the jobs where&amp;nbsp;&lt;SPAN&gt;run[&lt;/SPAN&gt;&lt;SPAN&gt;"state"&lt;/SPAN&gt;&lt;SPAN&gt;][&lt;/SPAN&gt;&lt;SPAN&gt;"state_message"&lt;/SPAN&gt;&lt;SPAN&gt;] contains&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;"Task df_regular failed with message: Workload failed" the code would be:&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;        for run in data["runs"]:
            # Check if the run failed and was within the last 24 hours
            run_end_time = datetime.datetime.fromtimestamp(run["end_time"] / 1000)
            if "state" in run and "result_state" in run["state"] and run["state"]["result_state"] == "FAILED" \
                and "Task df_regular failed with message: Workload failed" in run["state"]["state_message"] \
                and run_end_time &amp;gt; twenty_four_hours_ago:
                failed_runs.append(run)&lt;/LI-CODE&gt;&lt;P&gt;So I recommend just printing the run object, and building the filtering logic according to your criteria.&lt;/P&gt;&lt;P&gt;Hope it helps!&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 17 Sep 2024 16:19:03 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/need-to-automatically-rerun-the-failed-jobs-in-databricks/m-p/90760#M37990</guid>
      <dc:creator>filipniziol</dc:creator>
      <dc:date>2024-09-17T16:19:03Z</dc:date>
    </item>
  </channel>
</rss>

