cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to do perform deep clone for data migration from one Datalake to another?

Coders
New Contributor II

 

I'm attempting to migrate data from Azure Data Lake to S3 using deep clone. The data in the source Data Lake is stored in Parquet format and partitioned. I've tried to follow the documentation from Databricks, which suggests that I need to register Parquet tables with partitions in a catalog such as the Hive metastore before cloning. However, I encountered an error while attempting to execute the clone command.

First, I created a table in the target Data Lake with the following SQL statement:

CREATE TABLE IF NOT EXISTS customerOrderArchive (
column1 datatype,
column2 datatype,
...
)
PARTITIONED BY (batchid string)
LOCATION 'targetPath/temp';

This table was successfully created in the catalog.

Then, I attempted to execute the clone command as follows:

CREATE OR REPLACE TABLE cdf.customerOrderArchive
CLONE parquet.`sourcePath/customerorder/`

However, this resulted in the following error:


SparkException: Job aborted due to stage failure: Task 5 in stage 381.0 failed 4 times, most recent failure: Lost task 5.3 in stage 381.0 (TID 2488) (10.139.64.15 executor 0): com.databricks.sql.transaction.tahoe.DeltaAnalysisException: Expecting 0 partition column(s): [], but found 1 partition column(s): [`batchid`] from parsing the file name: abfs://pathToSource/batchid=20240123235612633/part-00000-tid-792802563625641899-725c22de-6f96-439a-927b-f4e70648de3a-42-1.c000.snappy.parquet

Despite creating and registering the table in the Spark metadata catalog, I'm still encountering this error.

Can anyone provide insight into what might be causing this issue? Additionally, is it possible to migrate data from one Data Lake to another using deep clone?

 

1 REPLY 1

Coders
New Contributor II

Hi Kaniz,

Thank you for the response. I'm pretty sure that source table is partitioned by batchid and in the target datalake I have registered the table using the same partition batchid as explained in the question. 

So I'm not sure why it's throwing an error when I try to clone.

Another question I have is, how do I provide external table location while cloning the data? I see that's syntax doesn't support that. Is it because clone cannot be used to copy data from one datalake to another?

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group