Hi @rohith_23 ,
These errors all relate to problems communicating with the Hive Metastore Service (HMS), which is the central component to store metadata (schemas, table locations, column types, etc.) about your tables.
The core of the issue in all three errors is a transport/network failure between the client (Spark job) and the HMS, specifically involving the Apache Thrift protocol that Hive uses for communication.
As you mentioned, "I am facing this when there are lot of queries fired simultaneously." the causes are possibly the Metastore Overload due to many concurrent requests (especially complex ones like listing partitions on huge tables).
SocketException: Connection reset / Connection reset by peer is also seen when Metastore was either too busy to respond in time. (I do not suspect a crash, as it eventually recovers)
Increasing Timeout may reduce these errors, as the client now waits for a response from the Metastore for longer, allowing more time to process complex requests (e.g., listing many partitions). While increasing the socket timeout can mitigate the client-side issue, it does not resolve underlying server resource limitations or query performance bottlenecks.
I would suggest you to check the monitoring page of the warehouse to see if the clusters were starting or stopping during this time. Check the Peak query count, running queries, their durations to get more understanding. You may have to size the warehouse according to the query concurrency requests.
You can try increasing the SocketTimeout Value, in JDBC connections, explicitly set a longer SocketTimeout
in the connection URL. For example: jdbc:spark://<server-hostname>:443;HttpPath=<http-path>;TransportMode=http;SSL=1;SocketTimeout=300
Additionally, these configs are not supported on warehouse as you can see in the error [CONFIG_NOT_AVAILABLE]
Thanks!