08-09-2023 10:34 PM - edited 08-10-2023 09:02 AM
I have a few Databricks clusters, some share a single Hive Metastore (HMS), call them PROD_CLUSTERS, and an additional cluster, ADHOC_CLUSTER, which has its own HMS.
All my data is stored in S3, as Databricks delta tables: PROD_CLUSTERS have read-write on those s3 buckets, and ADHOC_CLUSTER has only read privileges.
I need to set up ADHOC_CLUSTER so that it has a set of views on top of all the tables (as s3 paths) defined by the PROD_CLUSTERS.
For example, if one of the PROD_CLUSTERS created a table prod_schema.employees with LOCATION "s3://somebucket/employees", then in ADHOC_CLUSTER I would run (via a Databricks notebook):
CREATE VIEW live_views_prod_schema.employees as SELECT * FROM delta.`s3://somebucket/employees`
What I tried
My thinking was to have a script in ADHOC_CLUSTER, run a SELECT against the HMS of the PROD_CLUSTERS (meaning JDBC to the MySQL HMS, to query the relational DB directly), get all tables names and s3 locations, and programmatically issue all the necessary CREATE VIEW statements in ADHOC_CLUSTER.
I imagined this would be simple as in HMS the SDS table would hold the location (s3://somebucket/employees) in column SDS.location, for every delta table created by PROD_CLUSTERS (and some Json based tables). The problem is: in SDS I only find the correct location for some of the tables.
For other tables, the s3 path listed in HMS's SDS.location is pointing to the default bucket of the schema (the table was created on a different bucket), to a path that goes like
s3a://<default-bucket>/prod_schema.db/emplyees-__PLACEHOLDER__
Couldn't get past this, nor find the actual data files location somewhere else in the HMS.
My question
How can one query HMS to get the full path for all data files of tables defined in that HMS?
Thanks!
08-24-2023 11:48 PM
Hi @Mo ,
Sorry for the gap, just back from vacation...
So yes, what you describe is possible, and quite close to what I'm doing now.
The thing is, assuming this is not done manually, this requires:
This is somewhat cumbersome, especially as there is more there is more than one prod_cluster. I simplified the design for the sake of clarity, but in prod clusters are added from time to time, all using the same HMS. Thus, by the setup listed above, I need to create those additional scripts and schedule them on each new prod cluster.
This is doable, but involves setting up, and maintaining, processes on quite a few places.
would be way simpler if I just had a single process running on adhoc_cluster, querying the HMS and creating the views.
Thanks for your suggestions!
08-10-2023 08:56 AM - edited 08-10-2023 09:01 AM
thanks @Mo! if you refer to https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-aux-describe-table.html then yes, am aware, but this will only work if I run it on the PROD_CLUSTERS, those that have the HMS configured as their own HMS.
It will not work as a query against the MySQL HMS, as I do from the ADHOC_CLUSTER - this cluster has a different HMS, table_name is not defined there.
08-10-2023 08:58 AM - edited 08-10-2023 09:02 AM
to clarify: I JDBC to the MySQL HMS to query the HMS tables directly, from an external source. clarified in post.
08-11-2023 01:10 AM
08-11-2023 01:15 AM
sorry the post was cut halfway through
08-24-2023 11:48 PM
Hi @Mo ,
Sorry for the gap, just back from vacation...
So yes, what you describe is possible, and quite close to what I'm doing now.
The thing is, assuming this is not done manually, this requires:
This is somewhat cumbersome, especially as there is more there is more than one prod_cluster. I simplified the design for the sake of clarity, but in prod clusters are added from time to time, all using the same HMS. Thus, by the setup listed above, I need to create those additional scripts and schedule them on each new prod cluster.
This is doable, but involves setting up, and maintaining, processes on quite a few places.
would be way simpler if I just had a single process running on adhoc_cluster, querying the HMS and creating the views.
Thanks for your suggestions!
08-11-2023 01:21 AM
df_sds = (spark.read
.format("jdbc")
.option("driver", driver)
.option("url", "jdbc:mysql://..../prod_hms")
.option("dbtable", "(select SD_ID, CD_ID, location from SDS) a")
.option("user", "prod_hms_mysql_user")
.option("password", password)
.load()
)
08-11-2023 01:22 AM - edited 08-11-2023 01:23 AM
09-01-2023 01:11 AM
Hi @Nino , To query HMS to get the full path for all data files of tables defined in that HMS, you can use the Hive MetaStore API. Specifically, you can use the GET_TABLE_FILES
operation to retrieve the file metadata for a given table, including the file path.
09-01-2023 03:26 AM
Hi @Kaniz_Fatma , do you mean this ? couldn't see there's a GET_TABLE_FILES, not sure how to make it work, if you have a working example or some pointer it would be great.
Thanks!
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group