12-04-2022 02:58 PM
Hi All,
Can anyone point me to either documentation or personally tried and tested method of backing up (and restoring) Unity Catalog and its associated managed tables? We're running on Azure and using ADLS Gen2.
Regards,
Ashley
12-05-2022 10:51 AM
@Ashley Betts May i know need for backup and re-store, usually as a best practice table data is stored in external location or external tables not managed tables. if you store data as external table regular copy or backup mechanism should work. will wait if we can get more inputs from any of our community members please
12-06-2022 02:39 PM
Hi @karthik p ,
I have to disagree.
Managed tables are the default way to create tables in Unity Catalog. These tables are stored in the Unity Catalog root storage location that you configured when you created a metastore. Databricks recommends using managed tables whenever possible to ensure support of Unity Catalog features. All managed tables use Delta Lake.
source: https://docs.databricks.com/data-governance/unity-catalog/best-practices.html#organize-your-data
@Ashley Betts if you see this:
Each metastore is configured with a root storage location, which is used for managed tables. You need to ensure that no users have direct access to this storage location. Giving access to the storage location could allow a user to bypass access controls in a Unity Catalog metastore and disrupt auditability. For these reasons, you should not reuse a bucket that is your current DBFS root file system or has previously been a DBFS root file system for the root storage location in your Unity Catalog metastore.
I don't have resources on the UC backup. If you read the above you can find out that Unity Catalog / metastore managed tables are stored in the metastore root bucket.
You should Create an IAM role that Databricks uses to give access to that storage bucket, so basically there shouldn't be other mechanisms to read/write data (outside the databricks) to make sure the data won't get corrupted, or someone will bypass the access control set in Unity Catalog. When you use Delta Tables you can use Time travel to restore the previous version of the tables.
Backup seems tricky as managed tables are no longer stored in locations corresponding to the names, but they have some sort of uuid and I think the mapping of the table name to the location is stored in the Databricks control plane (database/backend).
I have always liked external tables, but with the UC I am leaning more towards managed tables.
thanks,
Pat.
12-06-2022 04:26 PM
@Pat Sienkiewicz you are right, i moved bit a side i think (external table storage recommendation is without UC). for unity catalog managed tables in metastore which is root storage is recommended. thank you for above post @Ashley Betts above pat response will provide you more information.
@Pat Sienkiewicz but here i have one question in terms of backup, i do remembered for one of Databricks E2 migrations, we have moved managed table data which will be under /user/hive/warehouse, if that is possible UC metastore managed table migration also should be possible.
12-06-2022 10:25 PM
@karthik p migration is a different story I would say.
just few ideas, to migrate data to different metastore / UC one could use delta sharing then transfer the data. You have many options here for example deep copy, insert into new table, etc.
other option is to setup external locations in your current workspace/uc end export data to external tables.
thanks,
Pat.
12-07-2022 02:31 PM
Thanks fellas, the main driver for backup/restore is risk management. To have a procedure in place following complete failure or malicious acts. We have SCD2 tables in databricks which are the only source of historic data. While I realise I have redundancy at the storage layer to cover partial failures this doesn't cover total failure or malicious (or even accidental) acts.
12-24-2022 07:39 PM
@Ashley Betts Let me know if you found a way to backup/restore UC metastore. Its valuable feature because unlike EXTERNAL hive_metastore where I could go and see the meta data in the SQL server tables (my external hive metastore ran on SQL server ) , the UC metadata is not accessible to me (its stored in databricks control plane) . Here the metadata I am referring to is schema and table and column names and ADLS folder path where table-data is stored for external tables.
@Pat Sienkiewicz when creating a UC catalog or schema I provided an ADLS folder path (different from UC metastore ADLS account) and created external delta tables. ( because I want the storage cost go to the business team) But the upside using managed tables is auto-optimize and space clean up when table is dropped. Though Unity manages permission on SPN to the ADLS folder where table data lives , the permission is applied when use is trying to access the folder thru databricks workspace (the permission doesn't apply when a user runs py script using that same SPN against ADLS folder outside workspace , say azure function app)
01-26-2024 08:18 AM
Our UC managed tables are stored on prod ADLS storage which is different from UC root storage account. So what's the best way to backup and restore UC managed tables into different region? One option is deep clone tables, copy ADLS folders to another region and then redefine the tables on them in a metastore in that region. But is there any other better way?
02-13-2024 03:19 PM
@prasad_vaze Have you got any direction on this? I am in the same boat.Looking for an approach to backup and restore the Tables.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group