Accessing data from a legacy hive metastore workspace on a new Unity Catalog workspace
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-28-2024 09:01 AM
Hello,
For the purposes of testing I'm interested in creating a new workspace with Unity Catalog enabled, and from there I'd like to access (external - S3) tables on an existing legacy hive metastore workspace (not UC enabled). The goal is for both workspaces would point to the same underlying S3 external location.
As a requirement I do not want to duplicate data & ideally updates to data on the legacy workspace would be reflected to tables surfaced through UC.
I was considering the possibility of shallow cloning, however from my understanding that is not possible across UC & hive metastore.
Does anybody have experience/recommendations on doing this? Looking through some databricks documentation I'm mostly finding information on upgrading a legacy workspace only.
#unitycatalog #hivemetastore
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-28-2024 01:54 PM
@Retired_mod From looking at the documentation none address my particular use case which I illustrated (2 workspaces on one account, 1 with UC and the other not). Was there a particular part on any of the docs you're suggesting can help here?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-31-2024 01:52 AM
Your aim is to access external S3 tables from a Unity Catalog workspace without data duplication and keeping data updates synchronized. Configure external location permissions. This ensure that both your Unity Catalog and Hive metastore workspaces have read permissions for the S3 location containing your tables. This allows both workspaces to access the same underlying data without duplication. Then create external tables in Unity Catalog with 'CREATE EXTERNAL TABLE ..' syntax, specifying the S3 location and schema of the existing table. This creates pointers to the existing data in S3 without copying it. Remember whether Hive or otherwise, external tables are just pointers, often used for ETL by overwriting the existing data (in your case on S3). Both Hive an Unity Catalog control the schema and point to data location but do not control the data itself. You can then access the data from both Hive and Unity Catalog.
HTH
London
United Kingdom
view my Linkedin profile
https://en.everybodywiki.com/Mich_Talebzadeh
Disclaimer: The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner Von Braun)".