Hi @Erik_L, I’m sorry to hear about the issues you’re facing with the Databricks scheduler. There could be several reasons for the scheduler failure at ~5:20 AM PT.
- If your cluster is running out of resources (CPU, memory), it might cause the scheduler to fail. Monitoring resource usage and scaling your cluster appropriately can help.
- Temporary network disruptions can affect the scheduler. Ensure that your network is stable and has low latency.
- Incorrect job configurations, such as timeout settings or dependencies, might lead to failures. Double-check your job configurations to ensure they are set correctly.
- Sometimes, the issue might be on Databricks’ end. Checking the Databricks status page for any ongoing incidents or maintenance can provide insights.
- Reviewing the logs and error messages can provide specific details about why the scheduler failed. Look for any patterns or recurring errors.
If you continue to experience reliability issues, considering alternative platforms might be a prudent step. However, before making any decisions, it’s essential to thoroughly investigate and understand the root cause of the failures.
Is there any specific error message or log entry that you noticed that might help narrow down the issue?
Thanks for sharing the image. It looks like it contains a bar graph with data points over a timeline from July 17 to July 27. However, the text in the legend is too small to read clearly.
To better assist you, could you provide more details about the specific error message or log entry you encountered? This information will help in diagnosing the issue with the Databricks scheduler. Additionally, if there are any specific patterns or anomalies in the graph that you think might be related to the scheduler failure, please point them out.
Let’s work together to get to the bottom of this!