Hey @sdheepak
The first thing you need to identify is the type of SQL Warehouse you are using in Databricks:
• Is it Serverless? If so, it is fully managed by Databricks, and you must contact Databricks support because you won’t have access to logs in your cloud provider.
• Is it Classic or Pro? In this case, you may be able to check logs in the EC2 instances (AWS) or virtual machines (Azure/GCP) within your cloud provider.
How can we monitor SQL Warehouse health in real-time? Yes, you can monitor the SQL Warehouse health by navigating to:
Compute > SQL Warehouses → Here, you can check:
Warehouse Type (Serverless, Classic, Pro)
Size & Active Status
Autoscale settings
Running Queries, Queued Queries, Query Peaks, and Completed Queries
For historical queries:
Go to SQL > Query History . Filter by cluster and date (up to 14 days max of history).
Are there any best practices for debugging SQL Warehouse when it hangs? If you are using Serverless, I strongly recommend switching to Classic mode. It is cheaper, allows better fine-tuning of infrastructure, and doesn’t autoscale as aggressively, meaning fewer Databricks Units (DBUs) and lower costs.
Is there a way to enable logging or diagnostics when the warehouse becomes unresponsive? It depends on what kind of “hung state” you are experiencing. If queries appear to be “running” indefinitely, check Query History to see if there are errors related to queries, connections, or processing failures.
If there are no visible logs or query failures, then you may need Databricks support to investigate deeper.
Are there any settings in Databricks that can help us auto-recover from such failures? If you are running these queries from an external orchestrator, Databricks does not provide built-in auto-recovery for SQL Warehouses.
The best solution is to implement a retry mechanism in your orchestrator/system, ensuring that queries automatically retry if no response is received.
• Alternative Approach:
Instead of running queries externally, you can create a Databricks Workflow with multiple tasks and configure retry policies to reduce failures.
Hope that helps 🙂