cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Kill/Cancel a Notebook Cell Running Too Long on an All-purpose Cluster

zenwanderer
New Contributor

Hi everyone, I’m facing an issue when running a notebook on a Databricks All-purpose cluster. Some of my cells/pipelines run for a very long time, and I want to automatically cancel/kill them when they exceed a certain time limit.

I tried setting spark.databricks.execution.timeout, but it doesn’t seem to have any effect in my case.

What I need is a timeout mechanism that can cancel the currently running notebook cell, not just a Spark job timeout.

If anyone can share guidance or official documentation references, I’d really appreciate it. Thanks in advance!

4 REPLIES 4

balajij8
Contributor
  • You can use signal to do this if running in a notebook for code validation
#Add in notebook
import signal

class TimeoutException(Exception):
    """Raised when a cell is run for very long time"""

def timeout_handler(signum, frame):
    raise TimeoutException("Timed out!")

def set_cell_timeout(seconds):
    signal.signal(signal.SIGALRM, timeout_handler)
    signal.alarm(seconds)

#Add in a notebook cell running notebook function
try:
    set_cell_timeout(30) # Set for 30 seconds
    #notebook function
finally:
    signal.alarm(0)
  • You can use lakeflow job notifications with threshold to cancel jobs running too long. Avoid using signals in notebooks running via jobs

What I’m looking for is a workspace-level monitoring approach: detect any notebook execution where a cell (or the run) has been running longer than a threshold, and then cancel/terminate it automatically.

I’ve tried looking into audit tables, REST APIs, but it seems they don’t provide enough visibility at cell-level

balajij8
Contributor

For the issue - Some of my cells/pipelines run for a very long time, and I want to automatically cancel/kill them when they exceed a certain time limit.

  • You can use job notifications with Metric threshold (Duration Warning for notifications & Duration Timeout for kill) to cancel jobs running too long (completion time more than Duration Timeout). More details here

MoJaMa
Databricks Employee
Databricks Employee

@zenwanderer Have you looked into Query Watchdog?

For Classic All-Purpose clusters this might be your best bet.

https://docs.databricks.com/aws/en/compute/troubleshooting/query-watchdog