I'm attempting to migrate data from Azure Data Lake to S3 using deep clone. The data in the source Data Lake is stored in Parquet format and partitioned. I've tried to follow the documentation from Databricks, which suggests that I need to register Parquet tables with partitions in a catalog such as the Hive metastore before cloning. However, I encountered an error while attempting to execute the clone command.
First, I created a table in the target Data Lake with the following SQL statement:
CREATE TABLE IF NOT EXISTS customerOrderArchive (
column1 datatype,
column2 datatype,
...
)
PARTITIONED BY (batchid string)
LOCATION 'targetPath/temp';
This table was successfully created in the catalog.
Then, I attempted to execute the clone command as follows:
CREATE OR REPLACE TABLE cdf.customerOrderArchive
CLONE parquet.`sourcePath/customerorder/`
However, this resulted in the following error:
SparkException: Job aborted due to stage failure: Task 5 in stage 381.0 failed 4 times, most recent failure: Lost task 5.3 in stage 381.0 (TID 2488) (10.139.64.15 executor 0): com.databricks.sql.transaction.tahoe.DeltaAnalysisException: Expecting 0 partition column(s): [], but found 1 partition column(s): [`batchid`] from parsing the file name: abfs://pathToSource/batchid=20240123235612633/part-00000-tid-792802563625641899-725c22de-6f96-439a-927b-f4e70648de3a-42-1.c000.snappy.parquet
Despite creating and registering the table in the Spark metadata catalog, I'm still encountering this error.
Can anyone provide insight into what might be causing this issue? Additionally, is it possible to migrate data from one Data Lake to another using deep clone?