How to do perform deep clone for data migration from one Datalake to another?

Coders — Tue, 12 Mar 2024 19:17:26 GMT

I'm attempting to migrate data from Azure Data Lake to S3 using deep clone. The data in the source Data Lake is stored in Parquet format and partitioned. I've tried to follow the documentation from Databricks, which suggests that I need to register Parquet tables with partitions in a catalog such as the Hive metastore before cloning. However, I encountered an error while attempting to execute the clone command.

First, I created a table in the target Data Lake with the following SQL statement:

CREATE TABLE IF NOT EXISTS customerOrderArchive (
column1 datatype,
column2 datatype,
...
)
PARTITIONED BY (batchid string)
LOCATION 'targetPath/temp';

This table was successfully created in the catalog.

Then, I attempted to execute the clone command as follows:

CREATE OR REPLACE TABLE cdf.customerOrderArchive
CLONE parquet.`sourcePath/customerorder/`

However, this resulted in the following error:

SparkException: Job aborted due to stage failure: Task 5 in stage 381.0 failed 4 times, most recent failure: Lost task 5.3 in stage 381.0 (TID 2488) (10.139.64.15 executor 0): com.databricks.sql.transaction.tahoe.DeltaAnalysisException: Expecting 0 partition column(s): [], but found 1 partition column(s): [`batchid`] from parsing the file name: abfs://pathToSource/batchid=20240123235612633/part-00000-tid-792802563625641899-725c22de-6f96-439a-927b-f4e70648de3a-42-1.c000.snappy.parquet

Despite creating and registering the table in the Spark metadata catalog, I'm still encountering this error.

Can anyone provide insight into what might be causing this issue? Additionally, is it possible to migrate data from one Data Lake to another using deep clone?

Re: How to do perform deep clone for data migration from one Datalake to another?

Coders — Wed, 13 Mar 2024 13:48:05 GMT

Hi Kaniz,

Thank you for the response. I'm pretty sure that source table is partitioned by batchid and in the target datalake I have registered the table using the same partition batchid as explained in the question.

So I'm not sure why it's throwing an error when I try to clone.

Another question I have is, how do I provide external table location while cloning the data? I see that's syntax doesn't support that. Is it because clone cannot be used to copy data from one datalake to another?

topic How to do perform deep clone for data migration from one Datalake to another? in Data Engineering

How to do perform deep clone for data migration from one Datalake to another?

Re: How to do perform deep clone for data migration from one Datalake to another?