<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Where in Hive Metastore can the s3 locations of Databricks tables be found? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/where-in-hive-metastore-can-the-s3-locations-of-databricks/m-p/39586#M27025</link>
    <description>&lt;P&gt;sorry the post was cut halfway through&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Fri, 11 Aug 2023 08:15:20 GMT</pubDate>
    <dc:creator>Nino</dc:creator>
    <dc:date>2023-08-11T08:15:20Z</dc:date>
    <item>
      <title>Where in Hive Metastore can the s3 locations of Databricks tables be found?</title>
      <link>https://community.databricks.com/t5/data-engineering/where-in-hive-metastore-can-the-s3-locations-of-databricks/m-p/39493#M26995</link>
      <description>&lt;P&gt;I have a few Databricks clusters, some share a single Hive Metastore (HMS),&amp;nbsp;call&amp;nbsp;them PROD_CLUSTERS, and an additional&amp;nbsp;cluster, ADHOC_CLUSTER, which has its own HMS.&amp;nbsp;&lt;/P&gt;&lt;P&gt;All my data is stored in S3, as Databricks delta tables: PROD_CLUSTERS have read-write on those s3 buckets, and ADHOC_CLUSTER&amp;nbsp;has only read privileges.&amp;nbsp;&lt;/P&gt;&lt;P&gt;I need to set up ADHOC_CLUSTER&amp;nbsp;so that it has a set of views on top of all the tables (as s3 paths) defined by the PROD_CLUSTERS.&amp;nbsp;&lt;/P&gt;&lt;P&gt;For example, if one of the PROD_CLUSTERS created a table&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;prod_schema.employees&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;with&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;LOCATION "s3://somebucket/employees", then in ADHOC_CLUSTER I would run (via a Databricks notebook):&lt;/P&gt;&lt;PRE&gt;CREATE VIEW live_views_prod_schema.employees&amp;nbsp;
as 
SELECT * FROM delta.`s3://somebucket/employees`&lt;/PRE&gt;&lt;P&gt;&lt;STRONG&gt;What I tried&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;My thinking was to have a script in ADHOC_CLUSTER, run a SELECT against the HMS of the PROD_CLUSTERS&amp;nbsp;&lt;SPAN&gt;(meaning JDBC to the MySQL HMS, to query the relational DB directly)&lt;/SPAN&gt;, get all tables names and s3 locations, and programmatically&amp;nbsp;issue all the&amp;nbsp;necessary CREATE VIEW statements in ADHOC_CLUSTER.&amp;nbsp;&lt;/P&gt;&lt;P&gt;I imagined this would be simple as in HMS the SDS table would hold the location (s3://somebucket/employees) in column&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;SDS.location, for&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;every&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;delta table created by PROD_CLUSTERS (and some Json based tables). The problem is:&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;in SDS I only find the correct location for some of the tables.&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;For other tables, the s3 path listed in HMS's&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;SDS.location&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;is&amp;nbsp;pointing to the default bucket of the schema (the table was created on a different bucket), to a path that goes like&lt;/P&gt;&lt;PRE&gt;s3a://&amp;lt;default-bucket&amp;gt;/prod_schema.db/emplyees-__PLACEHOLDER__

&lt;/PRE&gt;&lt;P&gt;Couldn't get past this, nor find the actual data files location somewhere else in the HMS.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;My question&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;How can one query HMS to get the full path for all data files of tables defined in that HMS?&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;</description>
      <pubDate>Thu, 10 Aug 2023 16:02:12 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/where-in-hive-metastore-can-the-s3-locations-of-databricks/m-p/39493#M26995</guid>
      <dc:creator>Nino</dc:creator>
      <dc:date>2023-08-10T16:02:12Z</dc:date>
    </item>
    <item>
      <title>Re: Where in Hive Metastore can the s3 locations of Databricks tables be found?</title>
      <link>https://community.databricks.com/t5/data-engineering/where-in-hive-metastore-can-the-s3-locations-of-databricks/m-p/39528#M27005</link>
      <description>&lt;P&gt;thanks&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/28727"&gt;@Mo&lt;/a&gt;!&amp;nbsp;if you refer to&amp;nbsp;&lt;A href="https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-aux-describe-table.html" target="_blank" rel="noopener"&gt;https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-aux-describe-table.html&lt;/A&gt;&amp;nbsp;then yes, am aware, but this will only work if I run it on the&amp;nbsp;&lt;SPAN&gt;PROD_CLUSTERS, those&lt;/SPAN&gt;&amp;nbsp;that have the HMS configured as their own HMS.&amp;nbsp;&lt;/P&gt;&lt;P&gt;It will not work as a query against the MySQL HMS, as I do from the&amp;nbsp;&lt;SPAN&gt;ADHOC_CLUSTER - this cluster has a different HMS,&amp;nbsp;table_name is not defined there.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 10 Aug 2023 16:01:15 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/where-in-hive-metastore-can-the-s3-locations-of-databricks/m-p/39528#M27005</guid>
      <dc:creator>Nino</dc:creator>
      <dc:date>2023-08-10T16:01:15Z</dc:date>
    </item>
    <item>
      <title>Re: Where in Hive Metastore can the s3 locations of Databricks tables be found?</title>
      <link>https://community.databricks.com/t5/data-engineering/where-in-hive-metastore-can-the-s3-locations-of-databricks/m-p/39529#M27006</link>
      <description>&lt;P&gt;to clarify: I JDBC to the MySQL HMS to query the HMS tables directly, from an external source. clarified in post.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 10 Aug 2023 16:02:55 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/where-in-hive-metastore-can-the-s3-locations-of-databricks/m-p/39529#M27006</guid>
      <dc:creator>Nino</dc:creator>
      <dc:date>2023-08-10T16:02:55Z</dc:date>
    </item>
    <item>
      <title>Re: Where in Hive Metastore can the s3 locations of Databricks tables be found?</title>
      <link>https://community.databricks.com/t5/data-engineering/where-in-hive-metastore-can-the-s3-locations-of-databricks/m-p/39583#M27023</link>
      <description>&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;Hey&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/28727"&gt;@Mo&lt;/a&gt;, thanks for the input!&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;Yes, points 1 &amp;amp; 2 are correct &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;Generally speaking, the solution you suggest is something like:&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;OL&gt;&lt;LI&gt;&lt;SPAN&gt;run&amp;nbsp;"describe&amp;nbsp;detail&amp;nbsp;table_name"&amp;nbsp;on one of the clusters from PROD_CLUSTERS group, for all tables, generating a list of table_name &amp;amp; table_s3_location&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN&gt;&amp;nbsp;Copy/move this list to the ADHOC_CLUSTER, and iterate over the list to create the view for each table&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;&lt;/OL&gt;&lt;DIV&gt;&lt;SPAN&gt;Actually, as a workaround, I already have a similar solution: a job in&amp;nbsp;****bleep** iterates over it to create the views, in ADHOC_CLUSTER.&amp;nbsp;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;Both jobs are scheduled over the day, so as new tables are created in PROD - new views are generated in ADHOC.&amp;nbsp;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;This works fine but is somewhat cumbersome (as it always requires 2 jobs for each prod &amp;amp; adhoc pair, some syncing, etc.).&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;My intention was to simplify this architecture.&amp;nbsp;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;The thing is, in step 1 above, the "describe" statement has to run one of the PROD_CLUSTERS.&amp;nbsp;That's what forces me to use 2 jobs: one listing the tables' definitions (running in prod) when reading that list and generating the views (in ad-hoc).&amp;nbsp;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;The&amp;nbsp;&amp;nbsp;"describe&amp;nbsp;detail&amp;nbsp;table_name"&amp;nbsp;cannot run in&amp;nbsp;ADHOC_CLUSTER - this cluster uses a different HMS, where there is no table_name.&amp;nbsp;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;To make it clear, there are 2 different Hive Metastores:&lt;/SPAN&gt;&lt;/DIV&gt;&lt;OL&gt;&lt;LI&gt;&lt;SPAN&gt;PROD_CLUSTERS - prod_HMS&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN&gt;ADHOC_CLUSTER - adhoc_HMS&lt;/SPAN&gt;&lt;/LI&gt;&lt;/OL&gt;&lt;DIV&gt;&lt;SPAN&gt;Hence my attempt to JDBC to prod_HMS, directly to the MySQL instance, from a notebook running in&amp;nbsp;ADHOC_CLUSTER, and run a SELECT that will "imitate" the functionality of "describe&amp;nbsp;detail&amp;nbsp;table_name".&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Fri, 11 Aug 2023 08:10:59 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/where-in-hive-metastore-can-the-s3-locations-of-databricks/m-p/39583#M27023</guid>
      <dc:creator>Nino</dc:creator>
      <dc:date>2023-08-11T08:10:59Z</dc:date>
    </item>
    <item>
      <title>Re: Where in Hive Metastore can the s3 locations of Databricks tables be found?</title>
      <link>https://community.databricks.com/t5/data-engineering/where-in-hive-metastore-can-the-s3-locations-of-databricks/m-p/39586#M27025</link>
      <description>&lt;P&gt;sorry the post was cut halfway through&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 11 Aug 2023 08:15:20 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/where-in-hive-metastore-can-the-s3-locations-of-databricks/m-p/39586#M27025</guid>
      <dc:creator>Nino</dc:creator>
      <dc:date>2023-08-11T08:15:20Z</dc:date>
    </item>
    <item>
      <title>Re: Where in Hive Metastore can the s3 locations of Databricks tables be found?</title>
      <link>https://community.databricks.com/t5/data-engineering/where-in-hive-metastore-can-the-s3-locations-of-databricks/m-p/39588#M27026</link>
      <description>&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;here's an example query I run from ADHOC_CLUSTER:&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;df_sds = (spark.read
.format("jdbc")
.option("driver", driver)
.option("url", "jdbc:mysql://..../prod_hms")
.option("dbtable", "(select SD_ID, CD_ID, location from SDS) a")
.option("user", "prod_hms_mysql_user")
.option("password", password)
.load()
)​&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;I expected "location" will be the S3 ****bleep** in the original posting).&amp;nbsp;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;Thanks!&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Fri, 11 Aug 2023 08:21:11 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/where-in-hive-metastore-can-the-s3-locations-of-databricks/m-p/39588#M27026</guid>
      <dc:creator>Nino</dc:creator>
      <dc:date>2023-08-11T08:21:11Z</dc:date>
    </item>
    <item>
      <title>Re: Where in Hive Metastore can the s3 locations of Databricks tables be found?</title>
      <link>https://community.databricks.com/t5/data-engineering/where-in-hive-metastore-can-the-s3-locations-of-databricks/m-p/39590#M27027</link>
      <description>&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;Something went wrong there, here's the last sentence: I expected "location" will be the s3 path but it's not always so (elaborated in the original posting).&amp;nbsp;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;Thanks!&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Fri, 11 Aug 2023 08:23:22 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/where-in-hive-metastore-can-the-s3-locations-of-databricks/m-p/39590#M27027</guid>
      <dc:creator>Nino</dc:creator>
      <dc:date>2023-08-11T08:23:22Z</dc:date>
    </item>
    <item>
      <title>Re: Where in Hive Metastore can the s3 locations of Databricks tables be found?</title>
      <link>https://community.databricks.com/t5/data-engineering/where-in-hive-metastore-can-the-s3-locations-of-databricks/m-p/41443#M27356</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/28727"&gt;@Mo&lt;/a&gt;&amp;nbsp;,&amp;nbsp;&lt;/P&gt;&lt;P&gt;Sorry for the gap, just back from vacation...&lt;/P&gt;&lt;P&gt;So yes, what you describe is possible, and quite close to what I'm doing now.&amp;nbsp;&lt;/P&gt;&lt;P&gt;The thing is, assuming this is not done manually, this requires:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;Setting up a process running all those&amp;nbsp;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;EM&gt;&lt;STRONG&gt;describe&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;FONT color="#FF0000"&gt;detail&lt;/FONT&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;table_name&lt;/STRONG&gt;&lt;/EM&gt;&lt;SPAN&gt;" on my prod_cluster&lt;/SPAN&gt;&amp;nbsp;&lt;/LI&gt;&lt;LI&gt;Transferring all the outputs to the adhoc_cluster&lt;/LI&gt;&lt;LI&gt;Running the CREATE VIEW scripts on adhoc_cluster&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;This is somewhat cumbersome, especially as there is more there is more than one&amp;nbsp;&lt;SPAN&gt;prod_cluster. I simplified the design for the sake of clarity, but in prod clusters are added from time to time, all using the same HMS. Thus, by the setup listed above, I need to create those additional scripts and schedule them on each new prod cluster.&amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;This is doable, but involves setting up, and maintaining, processes on quite a few places. &lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;would be way simpler if I just had a single process running on adhoc_cluster, querying the HMS and creating the views.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Thanks for your suggestions!&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 25 Aug 2023 06:48:55 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/where-in-hive-metastore-can-the-s3-locations-of-databricks/m-p/41443#M27356</guid>
      <dc:creator>Nino</dc:creator>
      <dc:date>2023-08-25T06:48:55Z</dc:date>
    </item>
    <item>
      <title>Re: Where in Hive Metastore can the s3 locations of Databricks tables be found?</title>
      <link>https://community.databricks.com/t5/data-engineering/where-in-hive-metastore-can-the-s3-locations-of-databricks/m-p/43061#M27479</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9"&gt;@Retired_mod&lt;/a&gt;&amp;nbsp;, do you mean &lt;A href="https://hive-metastore-client.readthedocs.io/en/latest/getstarted.html#available-methods" target="_self"&gt;this&lt;/A&gt;&amp;nbsp;? couldn't see there's a &lt;SPAN&gt;GET_TABLE_FILES, not sure how to make it work, if you have a working example or some pointer it would be great.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Thanks!&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 01 Sep 2023 10:26:58 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/where-in-hive-metastore-can-the-s3-locations-of-databricks/m-p/43061#M27479</guid>
      <dc:creator>Nino</dc:creator>
      <dc:date>2023-09-01T10:26:58Z</dc:date>
    </item>
  </channel>
</rss>

