cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Azure Databricks, migrating delta table data with CDF on.

grazie
Contributor

We are on Azure Databricks over ADLS Gen2 and have a set of tables and workflows that process data from and between those tables, using change data feeds. (We are not yet using Unity Catalog, and also not Hive metastore, just accessing delta tables from storage). 

Now we are required by IT to move data to a different Azure storage account, and I am faced with a conundrum. Delta documentation states that it is important to maintain timestamps if you want to preserve CDF. https://docs.delta.io/latest/delta-faq.html#can-i-copy-my-delta-lake-table-to-another-location

On the other hand, Azure support rep states that the only way they can think of is to manually use Storage Explorer something like this:


On Storage Explorer

Export the table as CSV, with option to export the timestamp

Import the CSV into the the new Storage Account changing the name of the Timestamp to something else, like Timestamp2.

Which will add a new field in the new table and also consumes more space because of it.

I need to try that recipe to see if it actually makes sense, but in reality we have many tables and such a manual process would be inefficient and error-prone. 

Maybe someone here knows a better way? 

3 REPLIES 3

grazie
Contributor

As it turns out, due to a misunderstanding, the responses from Azure support were answering a slightly different question (about Azure Table Storage instead of Delta Tables on Blob/ADLS Gen2), so we'll try there again. However, still interested in ideas/experiences from this community. 

Also, I couldn't find a way to fix the original post, or delete and recreate without the confusing details that turned out to be about Azure Table Storage so sorry about that.

Kaniz
Community Manager
Community Manager

Hi @grazie , Moving data between Azure storage accounts while preserving timestamps and ensuring efficient processes can indeed be a challenge. Let’s explore some options to achieve this without resorting to manual, error-prone steps:

 

Azure Databricks Mount Points:

Azure Service Principal and OAuth 2.0:

Automated Data Migration:

  • For multiple tables, consider automating the data migration process using Azure native tools.
  • You can use Azure Data Factory, Azure Logic Apps, or Azure Functions to orchestrate the movement of data from the old storage account to the new one.
  • These tools allow you to maintain timestamps and minimize manual intervention.

Delta Lake Table Copy:

  • While preserving timestamps is crucial for Change Data Feeds (CDF), you can explore using Delta Lake’s built-in features.
  • Create a new Delta Lake table in the new storage account and use the COPY INTO command to copy data from the old table to the new one.
  • Delta Lake will handle schema evolution, including timestamp columns, efficiently.

     Good luck! 🚀

Thank you for a comprehensive reply. I have comments to each point below. For now, we have actually decided to put off the move so it is not an imminent issue, though it will come back.

 

The basic issue is related to delta tables, and the fact that timestamps on the files themselves (as opposed to file contents that include timestamps)

 

- Azure Databricks Mount Points
  - This only affects how data is accessed through Databricks, and doesn't deal with the actual moving of data between accounts.

 

- Azure Service Principal and OAuth 2.0
  - This is indeed how we connect to the storage. But again, this doesn't address the issue of moving data between accounts.

 

- Automated Data Migration
  - This addresses the data movement issue, and could be interesting to learn more about. It all depends on whether "allow you to maintain timestamps and minimize manual intervention" is true for the specific timestamps that delta consumer depends on. If you have more specifics on this it would be helpful to anyone with similar issue. I found the following statement which seems to suggest this is not simple: https://learn.microsoft.com/en-us/answers/questions/656437/is-it-possible-to-edit-the-lastmodified-a...

 

- Delta Lake Table Copy
  - COPY INTO "Loads data from a file location into a Delta table.", and accepts the formats CSV, JSON, AVRO, ORC, PARQUET, TEXT, BINARYFILE. This does not seem to address the basic issue
  - We were also considering deep clone, but according to Databricks documentation "A cloned table has an independent history from its source table. Time travel queries on a cloned table do not work with the same inputs as they work on its source table."
Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.