cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Where in Hive Metastore can the s3 locations of Databricks tables be found?

Nino
Contributor

I have a few Databricks clusters, some share a single Hive Metastore (HMS), call them PROD_CLUSTERS, and an additional cluster, ADHOC_CLUSTER, which has its own HMS. 

All my data is stored in S3, as Databricks delta tables: PROD_CLUSTERS have read-write on those s3 buckets, and ADHOC_CLUSTER has only read privileges. 

I need to set up ADHOC_CLUSTER so that it has a set of views on top of all the tables (as s3 paths) defined by the PROD_CLUSTERS. 

For example, if one of the PROD_CLUSTERS created a table prod_schema.employees with LOCATION "s3://somebucket/employees", then in ADHOC_CLUSTER I would run (via a Databricks notebook):

CREATE VIEW live_views_prod_schema.employees 
as 
SELECT * FROM delta.`s3://somebucket/employees`

What I tried

My thinking was to have a script in ADHOC_CLUSTER, run a SELECT against the HMS of the PROD_CLUSTERS (meaning JDBC to the MySQL HMS, to query the relational DB directly), get all tables names and s3 locations, and programmatically issue all the necessary CREATE VIEW statements in ADHOC_CLUSTER. 

I imagined this would be simple as in HMS the SDS table would hold the location (s3://somebucket/employees) in column SDS.location, for every delta table created by PROD_CLUSTERS (and some Json based tables). The problem is: in SDS I only find the correct location for some of the tables.

For other tables, the s3 path listed in HMS's SDS.location is pointing to the default bucket of the schema (the table was created on a different bucket), to a path that goes like

s3a://<default-bucket>/prod_schema.db/emplyees-__PLACEHOLDER__

Couldn't get past this, nor find the actual data files location somewhere else in the HMS.

My question

How can one query HMS to get the full path for all data files of tables defined in that HMS?

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions

Nino
Contributor

Hi @Mo , 

Sorry for the gap, just back from vacation...

So yes, what you describe is possible, and quite close to what I'm doing now. 

The thing is, assuming this is not done manually, this requires:

  1. Setting up a process running all those "describe detail table_name" on my prod_cluster 
  2. Transferring all the outputs to the adhoc_cluster
  3. Running the CREATE VIEW scripts on adhoc_cluster

This is somewhat cumbersome, especially as there is more there is more than one prod_cluster. I simplified the design for the sake of clarity, but in prod clusters are added from time to time, all using the same HMS. Thus, by the setup listed above, I need to create those additional scripts and schedule them on each new prod cluster.  

This is doable, but involves setting up, and maintaining, processes on quite a few places.

would be way simpler if I just had a single process running on adhoc_cluster, querying the HMS and creating the views. 

Thanks for your suggestions!

View solution in original post

8 REPLIES 8

Nino
Contributor

thanks @Mo! if you refer to https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-aux-describe-table.html then yes, am aware, but this will only work if I run it on the PROD_CLUSTERS, those that have the HMS configured as their own HMS. 

It will not work as a query against the MySQL HMS, as I do from the ADHOC_CLUSTER - this cluster has a different HMS, table_name is not defined there. 

Nino
Contributor

to clarify: I JDBC to the MySQL HMS to query the HMS tables directly, from an external source. clarified in post. 

Nino
Contributor
Hey @Mo, thanks for the input!

Yes, points 1 & 2 are correct 🙂

Generally speaking, the solution you suggest is something like:

  1. run "describe detail table_name" on one of the clusters from PROD_CLUSTERS group, for all tables, generating a list of table_name & table_s3_location
  2.  Copy/move this list to the ADHOC_CLUSTER, and iterate over the list to create the view for each table 
Actually, as a workaround, I already have a similar solution: a job in ****bleep** iterates over it to create the views, in ADHOC_CLUSTER. 

Both jobs are scheduled over the day, so as new tables are created in PROD - new views are generated in ADHOC. 

This works fine but is somewhat cumbersome (as it always requires 2 jobs for each prod & adhoc pair, some syncing, etc.).

My intention was to simplify this architecture. 

The thing is, in step 1 above, the "describe" statement has to run one of the PROD_CLUSTERS. That's what forces me to use 2 jobs: one listing the tables' definitions (running in prod) when reading that list and generating the views (in ad-hoc). 

The  "describe detail table_name" cannot run in ADHOC_CLUSTER - this cluster uses a different HMS, where there is no table_name. 

To make it clear, there are 2 different Hive Metastores:
  1. PROD_CLUSTERS - prod_HMS
  2. ADHOC_CLUSTER - adhoc_HMS
Hence my attempt to JDBC to prod_HMS, directly to the MySQL instance, from a notebook running in ADHOC_CLUSTER, and run a SELECT that will "imitate" the functionality of "describe detail table_name".

sorry the post was cut halfway through 

Nino
Contributor

Hi @Mo , 

Sorry for the gap, just back from vacation...

So yes, what you describe is possible, and quite close to what I'm doing now. 

The thing is, assuming this is not done manually, this requires:

  1. Setting up a process running all those "describe detail table_name" on my prod_cluster 
  2. Transferring all the outputs to the adhoc_cluster
  3. Running the CREATE VIEW scripts on adhoc_cluster

This is somewhat cumbersome, especially as there is more there is more than one prod_cluster. I simplified the design for the sake of clarity, but in prod clusters are added from time to time, all using the same HMS. Thus, by the setup listed above, I need to create those additional scripts and schedule them on each new prod cluster.  

This is doable, but involves setting up, and maintaining, processes on quite a few places.

would be way simpler if I just had a single process running on adhoc_cluster, querying the HMS and creating the views. 

Thanks for your suggestions!

Nino
Contributor
here's an example query I run from ADHOC_CLUSTER:
 
 

 

df_sds = (spark.read
.format("jdbc")
.option("driver", driver)
.option("url", "jdbc:mysql://..../prod_hms")
.option("dbtable", "(select SD_ID, CD_ID, location from SDS) a")
.option("user", "prod_hms_mysql_user")
.option("password", password)
.load()
)​

 

I expected "location" will be the S3 ****bleep** in the original posting). 
Thanks!
 

Nino
Contributor
Something went wrong there, here's the last sentence: I expected "location" will be the s3 path but it's not always so (elaborated in the original posting). 
Thanks!
 

Hi @Retired_mod , do you mean this ? couldn't see there's a GET_TABLE_FILES, not sure how to make it work, if you have a working example or some pointer it would be great. 

Thanks!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group