cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How can we connect to 2 different hive spark.hadoop.hive.metastore.uris

maskepravin02
New Contributor II

We need to read a table from 2 different spark.hadoop.hive.metastore.uris and do some validations.

We are not able to connect to both spark.hadoop.hive.metastore.uris at the same time using sparkSession.

I will be using Spark version: 3.1.1 and the language is Scala.

Please comment if any suggestions.

 

1 ACCEPTED SOLUTION

Accepted Solutions

ashraf1395
Contributor

Hi there @maskepravin02,
We have once implemented this approach of two reading two different hive metasores, but it was not on AWS and GCP, maybe the docs can help.

Though it is not recommended 

The best approach is to create separate spark applications to connect each metastore, maybe orchestrate and write them and then join them.

- One other method can be dynamic switching but it is quite error-prone, I don't know whether it will support for AWS and GCP or not :
Here are the docs :
1. https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html

2. https://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties

3. https://stackoverflow.com/questions/32714396/querying-on-multiple-hive-stores-using-apache-spark

4. Some code I extracted from GPT and Gemini: 

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder()
  .appName("Dynamic Hive Metastore")
  .enableHiveSupport()
  .getOrCreate()

def switchMetastore(spark: SparkSession, metastoreUri: String): Unit = {
  // Set the Hive metastore URI dynamically
  spark.conf.set("spark.hadoop.hive.metastore.uris", metastoreUri)
  // Refresh the catalog to ensure it uses the new metastore
  spark.catalog.refreshTable("your_table")
}

// Example usage
switchMetastore(spark, "thrift://aws-metastore-uri:9083")
val awsDf = spark.sql("SELECT * FROM your_table")
awsDf.show()

switchMetastore(spark, "thrift://gcp-metastore-uri:9083")
val gcpDf = spark.sql("SELECT * FROM your_table")
gcpDf.show()

spark.stop()

 Hope this helps you move forward.

View solution in original post

3 REPLIES 3

Kaniz_Fatma
Community Manager
Community Manager

Hi @maskepravin02

  • How about creating separate Spark sessions for each metastore URI. This way, you can connect to different metastores independently.
  • Set the metastore URI dynamically based on the table you’re accessing.
  • Instead of .config("hive.metastore.uris", ...), try using .config("spark.hadoop.hive.metastore.uris", ...).
  • If you encounter any issues, feel free to ask for further assistance!

maskepravin02
New Contributor II

@Kaniz_Fatma We have used spark.hadoop.hive.metastore.uris.

Created 2 spark session in same application, with different hive metastore uris 1st is for AWS with all AWS properties and 2nd is for GCP with all GCP connection properties.

Where we have 1st spark session and 2nd spark session also pointing to 1st only if at all we created 2nd spark session.

It seems internally we will only create only 1 spark context per applications, let me know if you have any sample code or any other documentation regarding the same.

Thanks in advance ! 

ashraf1395
Contributor

Hi there @maskepravin02,
We have once implemented this approach of two reading two different hive metasores, but it was not on AWS and GCP, maybe the docs can help.

Though it is not recommended 

The best approach is to create separate spark applications to connect each metastore, maybe orchestrate and write them and then join them.

- One other method can be dynamic switching but it is quite error-prone, I don't know whether it will support for AWS and GCP or not :
Here are the docs :
1. https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html

2. https://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties

3. https://stackoverflow.com/questions/32714396/querying-on-multiple-hive-stores-using-apache-spark

4. Some code I extracted from GPT and Gemini: 

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder()
  .appName("Dynamic Hive Metastore")
  .enableHiveSupport()
  .getOrCreate()

def switchMetastore(spark: SparkSession, metastoreUri: String): Unit = {
  // Set the Hive metastore URI dynamically
  spark.conf.set("spark.hadoop.hive.metastore.uris", metastoreUri)
  // Refresh the catalog to ensure it uses the new metastore
  spark.catalog.refreshTable("your_table")
}

// Example usage
switchMetastore(spark, "thrift://aws-metastore-uri:9083")
val awsDf = spark.sql("SELECT * FROM your_table")
awsDf.show()

switchMetastore(spark, "thrift://gcp-metastore-uri:9083")
val gcpDf = spark.sql("SELECT * FROM your_table")
gcpDf.show()

spark.stop()

 Hope this helps you move forward.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!