- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-09-2023 10:34 PM - edited 08-10-2023 09:02 AM
I have a few Databricks clusters, some share a single Hive Metastore (HMS), call them PROD_CLUSTERS, and an additional cluster, ADHOC_CLUSTER, which has its own HMS.
All my data is stored in S3, as Databricks delta tables: PROD_CLUSTERS have read-write on those s3 buckets, and ADHOC_CLUSTER has only read privileges.
I need to set up ADHOC_CLUSTER so that it has a set of views on top of all the tables (as s3 paths) defined by the PROD_CLUSTERS.
For example, if one of the PROD_CLUSTERS created a table prod_schema.employees with LOCATION "s3://somebucket/employees", then in ADHOC_CLUSTER I would run (via a Databricks notebook):
CREATE VIEW live_views_prod_schema.employees as SELECT * FROM delta.`s3://somebucket/employees`
What I tried
My thinking was to have a script in ADHOC_CLUSTER, run a SELECT against the HMS of the PROD_CLUSTERS (meaning JDBC to the MySQL HMS, to query the relational DB directly), get all tables names and s3 locations, and programmatically issue all the necessary CREATE VIEW statements in ADHOC_CLUSTER.
I imagined this would be simple as in HMS the SDS table would hold the location (s3://somebucket/employees) in column SDS.location, for every delta table created by PROD_CLUSTERS (and some Json based tables). The problem is: in SDS I only find the correct location for some of the tables.
For other tables, the s3 path listed in HMS's SDS.location is pointing to the default bucket of the schema (the table was created on a different bucket), to a path that goes like
s3a://<default-bucket>/prod_schema.db/emplyees-__PLACEHOLDER__
Couldn't get past this, nor find the actual data files location somewhere else in the HMS.
My question
How can one query HMS to get the full path for all data files of tables defined in that HMS?
Thanks!