Databricks Community

Marco37 · 3 weeks ago

Hi,

we have a Azure Databricks Workspace that uses Unity Catalog for storing data. We use a seperate storage account to store the catalogs. We need to enable the option "Infrastructure encryption" on this storage account and this is unfortunatly only possible during creation of a storage account.

Our plan is:

create a new storage account
stop all compute clusters and disable jobs (it is a very small environment)
copy the data to that new storage account with Azure storage Explorer (copy blob container)
Re-create the current storage account with the same name
copy the data back to this storage account with Azure storage Explorer (copy blob container)
enable jobs

I did noticed that the file date of the files on the storage account do change to the current date after the copy activity. Is this a problem for databricks, or is this file date not used because the lineage is stored in databricks tables?

I'm a infra person and not a data engineer.

Regards Marco

KrisJohannesen · 3 weeks ago

Hi Marco

Are you using managed or external tables?

This should be easily shown either in Unity Catalog, or by checking in your storage account if the folders are ID based or have actual table names (eg dim_customer or something).

either way I would move the data to the temporary storage layer using a deep clone. This ensures you keep all the internal references such as the delta log in Unity catalog - those will break if you simply copy the files in storage directly.
since the solution is not that big the cost won’t be significant

https://docs.databricks.com/aws/en/delta/clone

main difference is the cleanup. For managed tables you can simply drop the tables/schemas in UC - for external tables you need to go and physically delete the files in the storage account

lukaszmaron · 3 weeks ago

Hi Marco,

haha I actually went through similar case 🙂

if your UC is Databricks managed - it won't work the way you think. Copying blob or containers will be waste of time as far as I know. Or the other way - Databricks won't recognize the data inside automatically. In Databricks managed Unity Catalogs have "directories" like:

<container>/
└── __unitystorage/
    └── catalogs/
        └── <catalog-uuid>/
            └── schemas/
                └── <schema-uuid>/
                    └── tables/
                        └── <table-uuid>/
                            ├── _delta_log/
                            └── *.parquet

Catalogs and schemas are actualy UUID based, this is the problematic part. So when you create external location pointing to UC root catalog, it won't recognize it, it will skip it, it will look like it's empty basically...

This means that you will need to load this data using a pipeline to load it using WASBS or UC enabled external location back into "fresh managed catalog".

I was actually working for a tool to solve this specific issue - to copy this data nicely... if you'd like I'm happy to share it with you via GH in example. I tested it locally, but I would be more than happy to have someone test it out in practice. It's Deep Clone based too.

If the catalog is managed outside of Databricks, then it's a bit easier I'd say.

Let me know what if you have any questions or need support

Marco37 · 3 weeks ago

Thanks for the replies 🙂

We are using both managed and external tables.

Our environment is build up with 3 storage accounts:

...wedls. This is the storage account on which I need to enable "Infrastructure encryption".
...managedsa. This is the default storage account that Databricks creates during deployment. We do not use this storage account to store data.
...wemetadls. No idea why we this one, because it seems to be empty.

Is the plan below a plan that should work?

create storage account B with private endpoints
grant access connector permissions on storage account B
create new catalogs with a different name on storage account B
deep clone the tables from storage account A to storage account B
remove the catalogs from storage account A
delete storage account A
recreate the storage account A with the same name
grant access connector permissions on storage account A
create the catalogs on storage account A with the names I want to use
deep clone the tables from storage account B to storage account A
remove the catalogs from storage account B
delete storage account B

Regards,

Marco

KrisJohannesen · 3 weeks ago

@Marco37 yea that sounds like the plan I would do.
... and then of course remember to pause any jobs/workflows and stuff on top before you get going - and restart it once you are done!

lukaszmaron · 3 weeks ago

@Marco37 yes overall sound good.

Just remember some points:

Delta Lake transaction log history won't be copied (basically querying to a version of won't work)
CDC history will be lost
Lineage can be lost after 'catalog swap'
RLS/column masking stuff can't be deep cloned automatically, needs some preqres
streaming tables are not supported
Deep clone is DELTA Lake operation, won't work for Parquet, CSV, JSON
+ others limitations etc

I'm happy to discuss more details, actually I'm really interested in migration topics like this 😄

I guess the each container you show represents a catalog?