cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Workflow scheduler cancel unreliable

Erik_L
Contributor II

Workflow paramters

Warning: 4m 30s | Timeout: 6m 50s

The jobs took 20-50 minutes to cancel.

This workflow must have high reliability for our requirements. Does anyone know why the scheduler failed this morning at ~5:20 AM PT?

After several failures, we're considering moving our work loads off Databricks to ensure reliability.

1 REPLY 1

Kaniz_Fatma
Community Manager
Community Manager

Hi @Erik_LI’m sorry to hear about the issues you’re facing with the Databricks scheduler. There could be several reasons for the scheduler failure at ~5:20 AM PT. 

  1. If your cluster is running out of resources (CPU, memory), it might cause the scheduler to fail. Monitoring resource usage and scaling your cluster appropriately can help.
  2. Temporary network disruptions can affect the scheduler. Ensure that your network is stable and has low latency.
  3. Incorrect job configurations, such as timeout settings or dependencies, might lead to failures. Double-check your job configurations to ensure they are set correctly.
  4. Sometimes, the issue might be on Databricks’ end. Checking the Databricks status page for any ongoing incidents or maintenance can provide insights.
  5. Reviewing the logs and error messages can provide specific details about why the scheduler failed. Look for any patterns or recurring errors.

If you continue to experience reliability issues, considering alternative platforms might be a prudent step. However, before making any decisions, it’s essential to thoroughly investigate and understand the root cause of the failures.

Is there any specific error message or log entry that you noticed that might help narrow down the issue?

Thanks for sharing the image. It looks like it contains a bar graph with data points over a timeline from July 17 to July 27. However, the text in the legend is too small to read clearly.

To better assist you, could you provide more details about the specific error message or log entry you encountered? This information will help in diagnosing the issue with the Databricks scheduler. Additionally, if there are any specific patterns or anomalies in the graph that you think might be related to the scheduler failure, please point them out.

Let’s work together to get to the bottom of this!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group