We work with a application that connects to our Databricks serverless SQL warehouse via Databricks JDBC driver. It runs a few thousand SQL select statements everyday, and a small percentage of them will fail with the following error details:
java.sql.SQLException: [Databricks][JDBCDriver](500157) Socket timeout expired when executing query: SELECT COA***.; caused by com.databricks.client.support.exceptions.GeneralException: [Databricks][JDBCDriver](500157) Socket timeout expired when executing query: SELECT COA***.; caused by com.databricks.client.jdbc42.internal.apache.thrift.transport.TTransportException: java.net.SocketTimeoutException: Read timed out; caused by java.net.SocketTimeoutException: Read timed out
A full stack trace can be provided upon request.
On the SQL warehouse side the failed queries are showing this as the error: Query has been timed out due to inactivity., and upon reviewing the query profile the biggest standout metric is the Result fetching by client metric. That metric ranges from 6-20 minutes on those failed queries, which is dramatically longer compared to the usual fetching time of around ~225ms on successful ones.
Observations around failed statements:
- The statements fail typically after 10 seconds but there were several failures also after 15 or 16 seconds.
- Execution never gets into fetching phase.
- Failed statements continue execution on server side. Statements fail on server-side eventually, but it takes even 10 minutes or more. That time is reported as fetching time.
- When the failed statement is executed from sql console, it pass and data are fetched quickly as most of the queries returns low number of records.
- Failures happen during high load. serverless SQL warehouse is on its maximal configuration, i.e. no scale down or up during failures. If the HW is switched to more powerful nodes, it has no effect.
JDBC driver configuration:
- Driver version 2.6.38, happens also with 2.6.36
- Authentication by PAT
- SocketTimeOut either 0 or 300, happens for both values
- Preset catalog and schema
- Connection initialized with "SET ansi_mode = false"
Connection pool settings:
- HikariCP version 5.1.0
- Maximally 30 parallel connections to the SQL warehouse
We're looking for any advice on how to mitigate the failures.