cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to do perform deep clone for data migration from one Datalake to another?

Coders
New Contributor II

 

I'm attempting to migrate data from Azure Data Lake to S3 using deep clone. The data in the source Data Lake is stored in Parquet format and partitioned. I've tried to follow the documentation from Databricks, which suggests that I need to register Parquet tables with partitions in a catalog such as the Hive metastore before cloning. However, I encountered an error while attempting to execute the clone command.

First, I created a table in the target Data Lake with the following SQL statement:

CREATE TABLE IF NOT EXISTS customerOrderArchive (
column1 datatype,
column2 datatype,
...
)
PARTITIONED BY (batchid string)
LOCATION 'targetPath/temp';

This table was successfully created in the catalog.

Then, I attempted to execute the clone command as follows:

CREATE OR REPLACE TABLE cdf.customerOrderArchive
CLONE parquet.`sourcePath/customerorder/`

However, this resulted in the following error:


SparkException: Job aborted due to stage failure: Task 5 in stage 381.0 failed 4 times, most recent failure: Lost task 5.3 in stage 381.0 (TID 2488) (10.139.64.15 executor 0): com.databricks.sql.transaction.tahoe.DeltaAnalysisException: Expecting 0 partition column(s): [], but found 1 partition column(s): [`batchid`] from parsing the file name: abfs://pathToSource/batchid=20240123235612633/part-00000-tid-792802563625641899-725c22de-6f96-439a-927b-f4e70648de3a-42-1.c000.snappy.parquet

Despite creating and registering the table in the Spark metadata catalog, I'm still encountering this error.

Can anyone provide insight into what might be causing this issue? Additionally, is it possible to migrate data from one Data Lake to another using deep clone?

 

2 REPLIES 2

Kaniz_Fatma
Community Manager
Community Manager

Hi @CodersIt appears that you’re encountering an issue while attempting to migrate data from Azure data lake to S3 using deep clone.

Let’s break down the problem and explore potential solutions.

  1. Error Explanation: The error message you received indicates that the deep clone operation is encountering an issue related to partition columns. Specifically, it states: “Expecting 0 partition column(s): [], but found 1 partition column(s): [batchid] from parsing the file name.”

  2. Partition Columns:

    • In your target table (customerOrderArchive), you’ve defined a partition column named batchid.
    • When performing the deep clone, the system expects the source data files to match the partitioning structure of the target table.
    • However, it seems that the source Parquet files have a different partitioning scheme (specifically, they include a batchid partition).
  3. Possible Solutions:

    • Verify Partitioning:
      • Double-check the partitioning structure of the source Parquet files. Ensure that they match the partitioning columns defined in your target table.
      • If the source files have a different partitioning scheme, consider adjusting either the source data or the target table to align them.
    • Shallow Clone:
      • Instead of deep cloning, try using a shallow clone. Shallow clones do not copy data files; they only create metadata entries in the catalog.
      • Shallow clones are useful for maintaining sync between a target table and a production source table that receives appends, updates, and deletes.
    • Delta Lake Conversion:
      • Consider converting your source Parquet data to Delta Lake format before cloning.
      • Delta Lake provides ACID transactions, schema evolution, and other features that can simplify data management.
    • Check Databricks Runtime Version:
      • Ensure that you’re using Databricks Runtime 11.3 or above, as this feature requires it.
    • Review Documentation:
      • Refer to the official Databricks documentation for detailed guidance on deep cloning and its limitations.
  4. Migrating Data from One Data Lake to Another:

    • Yes, it’s possible to migrate data from one Data Lake to another using deep clone.
    • Deep clone allows you to incrementally convert data from Parquet or Iceberg data sources to managed or external Delta tables.
    • You can use it for migration, syncing, and creating ACID-compliant snapshots.

Remember to thoroughly review the documentation and ensure that your source and target tables are correctly configured.

 

Coders
New Contributor II

Hi Kaniz,

Thank you for the response. I'm pretty sure that source table is partitioned by batchid and in the target datalake I have registered the table using the same partition batchid as explained in the question. 

So I'm not sure why it's throwing an error when I try to clone.

Another question I have is, how do I provide external table location while cloning the data? I see that's syntax doesn't support that. Is it because clone cannot be used to copy data from one datalake to another?

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group