topic Re: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException in Data Engineering

org.apache.hadoop.hive.ql.metadata.HiveException: MetaException

rohith_23 — Mon, 22 Sep 2025 12:15:59 GMT

Hi Data Enthusiasts,

I have been facing few errors in SQL warehouse for quiet a long time and its happening pretty randomly.

We checked query runs and captured the errors below.
I believe this is something to do with hive. And I am facing this when there are lot of queries fired simultaneously.

Thanks in advance ! Any help is really appreciable!!

Error 1: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Got exception: org.apache.thrift.transport.TTransportException java.net.SocketTimeoutException: Read timed out)

Error 2: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Got exception: org.apache.thrift.transport.TTransportException java.net.SocketException: Connection reset)

Error 3: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Got exception: org.apache.thrift.transport.TTransportException java.net.SocketException: Connection reset by peer)

#databricks #warehouse #hive

Re: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException

Khaja_Zaffer — Mon, 22 Sep 2025 12:50:10 GMT

Hello @rohith_23

Good day!!

Thank you for sharing the details.

These errors are typically related to connectivity issues between your Databricks SQL warehouse and the Hive Metastore (HMS), often triggered by high concurrency overwhelming the metastore's connection handling.

1. Increase Client Socket Timeout:

spark.hadoop.hive.metastore.client.socket.timeout 1800

2. Increase HMS Client Pool Size:

spark.databricks.hive.metastore.client.pool.size 32

https://community.databricks.com/t5/data-engineering/super-slow-sql-queries-on-an-hc-cluster/td-p/19257

3. Migrate to Unity Catalog (Long-Term Fix)

Hive Metastore is legacy and prone to these scalability issues. Switch to Unity Catalog (UC), which is Databricks' modern metadata layer—it's more reliable, supports fine-grained access, and avoids HMS bottlenecks.

how to migrate to unity catalog:

https://docs.databricks.com/aws/en/data-governance/unity-catalog/migrate

I hope this helps.

Thank you.

Re: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException

rohith_23 — Tue, 23 Sep 2025 06:05:29 GMT

Hi @Khaja_Zaffer
Thank you for quick response !
How can I tune this configurations on SQL warehouse. I already tried.
It doesn't allow any tunings at SQL warehouse. But I can do it on all purpose cluster.
Kindly do the needful ! Please find the error message.

[CONFIG_NOT_AVAILABLE] Configuration spark.hadoop.hive.metastore.client.socket.timeout is not available. SQLSTATE: 42K0I

[CONFIG_NOT_AVAILABLE] Configuration spark.databricks.hive.metastore.client.pool.size is not available. SQLSTATE: 42K0I

Re: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException

NandiniN — Fri, 26 Sep 2025 11:51:20 GMT

Hi @rohith_23 ,

These errors all relate to problems communicating with the Hive Metastore Service (HMS), which is the central component to store metadata (schemas, table locations, column types, etc.) about your tables.

The core of the issue in all three errors is a transport/network failure between the client (Spark job) and the HMS, specifically involving the Apache Thrift protocol that Hive uses for communication.

As you mentioned, "I am facing this when there are lot of queries fired simultaneously." the causes are possibly the Metastore Overload due to many concurrent requests (especially complex ones like listing partitions on huge tables).

SocketException: Connection reset / Connection reset by peer is also seen when Metastore was either too busy to respond in time. (I do not suspect a crash, as it eventually recovers)

Increasing Timeout may reduce these errors, as the client now waits for a response from the Metastore for longer, allowing more time to process complex requests (e.g., listing many partitions). While increasing the socket timeout can mitigate the client-side issue, it does not resolve underlying server resource limitations or query performance bottlenecks.

I would suggest you to check the monitoring page of the warehouse to see if the clusters were starting or stopping during this time. Check the Peak query count, running queries, their durations to get more understanding. You may have to size the warehouse according to the query concurrency requests.

You can try increasing the SocketTimeout Value, in JDBC connections, explicitly set a longer SocketTimeout in the connection URL. For example: jdbc:spark://<server-hostname>:443;HttpPath=<http-path>;TransportMode=http;SSL=1;SocketTimeout=300

Additionally, these configs are not supported on warehouse as you can see in the error [CONFIG_NOT_AVAILABLE]

Thanks!