11-07-2022 08:37 AM
Hello all,
I'm currently trying to move the tables contained in one azure workspace to another, because of a change in the way we use our resources groups. I have not been able to move more than metadata with the databrickslabs/migrate repo. I was wondering if I could copy data from within a notebook but the JDBC connection keep saying to check the driver.
Is there any process to move data between workspaces ?
11-09-2022 02:10 AM
@Quentin Maire , If you cannot access data from outside you will have to migrate it from inside.
If the data is stored in the root container and is not accessible from outside (I think you should be able to make this data accessible with the Azure Policies, but I don't know how to do it right now) the option is to create separate location (storage account, container). CREATE EXTERNAL tables and migrate data there 1:1, for example, if you have
my_database.my_table_1, then in existing workspace you attach the new storage and migrate data:
CREATE EXTERNAL TABLE my_database_ext.my_table_1
(
col_1 INT,
col_2 STRING
)
LOCATION 'abfss://some-location/my_table_1';
INSERT INTO my_database_ext.my_table_1
SELECT * FROM my_database.my_table_1;
then in your new workspace, either you use external location as is:
CREATE EXTERNAL TABLE my_database.my_table_1
(
col_1 INT,
col_2 STRING
)
LOCATION 'abfss://some-location/my_table_1';
or you add extra step and copy data from these external tables into managed ones.
Either way, it seems like there is a need to migrate the data.
Ideally if you could access the storage in the root container, create the external table on top of it. Then in new workspace
INSERT INTO table_1
SELECT * FROM ext_table_1;
11-07-2022 02:01 PM
Hi @Quentin Maire ,
We need a bit more details.
where is your data stored ?
are you using external or managed tables?
the migrate tool allows you to export DDL statements not the data itself.
I can think about few scenarios on Top of my head.
if you had previously external tables you can create tables in the new workspace using the same adls path, it will allow you to access data.
if you used external tables but you need new location for them (storage account, etc). You cN copy data with azure native tools like az copy to new location. Then create external tables using new location.
I think the above one should work for managed tabela as well, you just need to find out where the data is stored - in the dbfs root location.
you can then copy data into new location and create external tables.
mare you planning to use unity catalog ? You can then create storage credential/external location using exisitng table location and copy data into managed or external tables.
there is few options here, bit if you used the managed tables then most-likely there is a need for copy in your case - if the data storage needs to change as well. The copy can be done with the az copy or some of the databricks options.
Please see here how to ingest data:
https://docs.databricks.com/ingestion/index.html
thanks,
Pat
11-08-2022 02:52 AM
Hey Pat, thanks a lot for your quick answer. I'm indeed using the managed tables, and would like to move the whole DB across the workspaces. There is a storage account related to the workspace, but I can see several containers in it: ephemeral, jobs, logs, meta and root. Is it safe to copy them from one storage account to another, overwriting the destination's ones ? Which of these containers contains the hive data ?
Thanks a lot,
Quentin
11-08-2022 03:00 AM
Hi @Quentin Maire ,
I didn't work much with the managed tables until now, but now with the Unity Catalog is little bit different.
I do think that data is stored in the `root` container, you should be able to see like hive table structure there:
root@something:
table_1/part=1
/part=2
I would have to dig deeper, but I would test the data copy on 1 table and see.
other: logs, etc. are workspace related. I don't think this is needed when migrating.
thanks,
Pat.
11-09-2022 01:20 AM
Hello Kaniz, there seems to be a databricks System Deny Assignment on the IAM of the Azure storage account for the original workspace. How can I get the wasb path of the existing tables of the existing workspace, without having access to the blob storage, and then get access to them ?
11-09-2022 01:31 AM
Hi @Quentin Maire ,
you can try this statement:
SHOW CREATE DATABASE `database-name`;
It should give you `location` it's default location for managed tables.
It might be that your data is not stored in the `root` container. I forgot this part.
Basically when you create database you can specify location, it will be used for the managed tables created in that database.
11-09-2022 01:45 AM
SHOW CREATE DATABASE raise a ParseException, but with DESCRIBE DATABASE, I was able to find that the DB location is dbfs:/user/hive/warehouse. I'm not sure that the proposed SO thread above works then, as the data is not accessible from outside. Is there another solution ?
11-09-2022 02:10 AM
@Quentin Maire , If you cannot access data from outside you will have to migrate it from inside.
If the data is stored in the root container and is not accessible from outside (I think you should be able to make this data accessible with the Azure Policies, but I don't know how to do it right now) the option is to create separate location (storage account, container). CREATE EXTERNAL tables and migrate data there 1:1, for example, if you have
my_database.my_table_1, then in existing workspace you attach the new storage and migrate data:
CREATE EXTERNAL TABLE my_database_ext.my_table_1
(
col_1 INT,
col_2 STRING
)
LOCATION 'abfss://some-location/my_table_1';
INSERT INTO my_database_ext.my_table_1
SELECT * FROM my_database.my_table_1;
then in your new workspace, either you use external location as is:
CREATE EXTERNAL TABLE my_database.my_table_1
(
col_1 INT,
col_2 STRING
)
LOCATION 'abfss://some-location/my_table_1';
or you add extra step and copy data from these external tables into managed ones.
Either way, it seems like there is a need to migrate the data.
Ideally if you could access the storage in the root container, create the external table on top of it. Then in new workspace
INSERT INTO table_1
SELECT * FROM ext_table_1;
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group