Databricks SQL Warehouse Hung - Queries Stuck in Queued State & No Alerts Triggered
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-02-2025 07:33 AM
We have been facing critical challenges with Databricks SQL Warehouse for the last four weeks. We are using Databricks SQL Warehouse injection from IICS, and we have observed the following issues:
- SQL Warehouse Going into a Hung State – The SQL Warehouse becomes completely unresponsive.
- All Queries Stuck in Queued State – None of the queries are processing, leading to severe workflow disruptions.
- No Alerts Triggered – Since the SQL Warehouse is hung, we do not receive any alerts, making it impossible to proactively respond.
- No Logs or Health Metrics Available – We do not have visibility into logs or any other SQL Warehouse health monitoring to diagnose the issue.
Questions & Help Needed:
- How can we monitor SQL Warehouse health in real-time?
- Are there any recommended best practices for debugging SQL Warehouse when it hangs?
- Is there a way to enable logging or diagnostics when the warehouse becomes unresponsive?
- Are there any settings in Databricks that can help us auto-recover from such failures?
This issue is severely impacting our workloads, and any guidance or solutions would be greatly appreciated.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-02-2025 03:37 PM
Hey @sdheepak
The first thing you need to identify is the type of SQL Warehouse you are using in Databricks:
• Is it Serverless? If so, it is fully managed by Databricks, and you must contact Databricks support because you won’t have access to logs in your cloud provider.
• Is it Classic or Pro? In this case, you may be able to check logs in the EC2 instances (AWS) or virtual machines (Azure/GCP) within your cloud provider.
How can we monitor SQL Warehouse health in real-time? Yes, you can monitor the SQL Warehouse health by navigating to:
Compute > SQL Warehouses → Here, you can check:
Warehouse Type (Serverless, Classic, Pro)
Size & Active Status
Autoscale settings
Running Queries, Queued Queries, Query Peaks, and Completed Queries
For historical queries:
Go to SQL > Query History . Filter by cluster and date (up to 14 days max of history).
Are there any best practices for debugging SQL Warehouse when it hangs? If you are using Serverless, I strongly recommend switching to Classic mode. It is cheaper, allows better fine-tuning of infrastructure, and doesn’t autoscale as aggressively, meaning fewer Databricks Units (DBUs) and lower costs.
Is there a way to enable logging or diagnostics when the warehouse becomes unresponsive? It depends on what kind of “hung state” you are experiencing. If queries appear to be “running” indefinitely, check Query History to see if there are errors related to queries, connections, or processing failures.
If there are no visible logs or query failures, then you may need Databricks support to investigate deeper.
Are there any settings in Databricks that can help us auto-recover from such failures? If you are running these queries from an external orchestrator, Databricks does not provide built-in auto-recovery for SQL Warehouses.
The best solution is to implement a retry mechanism in your orchestrator/system, ensuring that queries automatically retry if no response is received.
• Alternative Approach:
Instead of running queries externally, you can create a Databricks Workflow with multiple tasks and configure retry policies to reduce failures.
Hope that helps 🙂
