yesterday
Hi,
we have a Azure Databricks Workspace that uses Unity Catalog for storing data. We use a seperate storage account to store the catalogs. We need to enable the option "Infrastructure encryption" on this storage account and this is unfortunatly only possible during creation of a storage account.
Our plan is:
I did noticed that the file date of the files on the storage account do change to the current date after the copy activity. Is this a problem for databricks, or is this file date not used because the lineage is stored in databricks tables?
I'm a infra person and not a data engineer.
Regards Marco
yesterday
Hi Marco
Are you using managed or external tables?
This should be easily shown either in Unity Catalog, or by checking in your storage account if the folders are ID based or have actual table names (eg dim_customer or something).
either way I would move the data to the temporary storage layer using a deep clone. This ensures you keep all the internal references such as the delta log in Unity catalog - those will break if you simply copy the files in storage directly.
since the solution is not that big the cost wonโt be significant
https://docs.databricks.com/aws/en/delta/clone
main difference is the cleanup. For managed tables you can simply drop the tables/schemas in UC - for external tables you need to go and physically delete the files in the storage account
yesterday - last edited yesterday
Hi Marco,
haha I actually went through similar case ๐
if your UC is Databricks managed - it won't work the way you think. Copying blob or containers will be waste of time as far as I know. Or the other way - Databricks won't recognize the data inside automatically. In Databricks managed Unity Catalogs have "directories" like:
<container>/
โโโ __unitystorage/
โโโ catalogs/
โโโ <catalog-uuid>/
โโโ schemas/
โโโ <schema-uuid>/
โโโ tables/
โโโ <table-uuid>/
โโโ _delta_log/
โโโ *.parquetCatalogs and schemas are actualy UUID based, this is the problematic part. So when you create external location pointing to UC root catalog, it won't recognize it, it will skip it, it will look like it's empty basically...
This means that you will need to load this data using a pipeline to load it using WASBS or UC enabled external location back into "fresh managed catalog".
I was actually working for a tool to solve this specific issue - to copy this data nicely... if you'd like I'm happy to share it with you via GH in example. I tested it locally, but I would be more than happy to have someone test it out in practice. It's Deep Clone based too.
If the catalog is managed outside of Databricks, then it's a bit easier I'd say.
Let me know what if you have any questions or need support
10 hours ago
Thanks for the replies ๐
We are using both managed and external tables.
Our environment is build up with 3 storage accounts:
Is the plan below a plan that should work?
Regards,
Marco
10 hours ago
@Marco37 yea that sounds like the plan I would do.
... and then of course remember to pause any jobs/workflows and stuff on top before you get going - and restart it once you are done!
9 hours ago
@Marco37 yes overall sound good.
Just remember some points:
I'm happy to discuss more details, actually I'm really interested in migration topics like this ๐
I guess the each container you show represents a catalog?