cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How to implement Lifecycle of Data When Use ADLS

SebastianCar28
New Contributor

Hello everyone, nice to greet you. I have a question about the data lifecycle in ADLS. I know ADLS has its own rules, but they aren't working properly because I have two ADLS accounts: one for hot data and another for cool storage where the information is archived. For this, I have connections to the external catalog with ADLS and Databricks, and when it deposits the information, it does so via Parquet files and the Delta Log. However, when I try to copy the data using ADF and empty my HOT Delta Lakehouse, everything fails because of the process within the Delta Log. Is there a way to copy the data from a HOT ADLS to a COOL ADLS so I can empty the HOT ADLS while continuing to produce new data and archive it in COOL ADLS for later use when needed?

1 REPLY 1

mark_ott
Databricks Employee
Databricks Employee

Yes, you can move data from your HOT ADLS account to a COOL ADLS account while handling Delta Lake log issues, but this requires special techniques due to the nature of Delta Lakeโ€™s transaction log. The problem stems from Delta tablesโ€™ dependency on the Delta Log (_delta_log directory) for consistency and transactional integrity; simply copying Parquet files isn't sufficient since operations like DELETE and UPDATE are tracked in the Delta Log rather than directly modifying files.

Strategies for Migrating Delta Lake Data

1. Use Databricks โ€œVACUUMโ€ and โ€œOPTIMIZEโ€ Before Archiving

  • Before migrating, use the VACUUM and OPTIMIZE commands on your Delta table in HOT ADLS to clean up stale data and compact files, reducing the data footprint and ensuring transactional integrity.

  • This step improves efficiency when copying data, but you still must preserve the _delta_log directory.

2. Copying Entire Delta Table (Including _delta_log)

  • To maintain Delta Lake integrity, you must copy both the Parquet files and the entire _delta_log directory from HOT to COOL ADLS.

  • You can use Azure Data Factory's (ADF) Copy Data activity or an Azure Databricks notebook with tools like dbutils.fs.cp.

  • Make sure to keep folder/file permissions and structure intact, as this is crucial for Delta Lake to recognize tables in COOL ADLS.

3. Use โ€œEXPORT TABLEโ€ or โ€œWRITEโ€ as Parquet for Pure Historical Storage

  • If COOL storage only needs archived snapshots and not active Delta tables, export the data as standalone Parquet files using Databricks (via .write.format("parquet")).

  • In this case, you lose Delta transactional features but simplify storage/retrieval.

  • This approach is useful for deep archive, but for downstream Delta processing you should copy the full Delta layout.

4. Automate the Workflow with ADF or Databricks Notebooks

  • Use ADF pipelines to run Databricks notebooks that handle Delta operations and copying, or orchestrate through scheduled jobs.

  • For complex lifecycles, consider the open-source Delta Lake archiving patterns for best practices.

Sample Workflow

  1. Use Databricks to OPTIMIZE and VACUUM tables in HOT ADLS.

  2. Deploy a notebook or ADF activity that recursively copies the full Delta folder (including _delta_log) to COOL ADLS.

  3. After copying, validate the Delta table in COOL ADLS by mounting it in Databricks and reading/querying it.

  4. Once validated, delete or truncate HOT ADLS data for new data cycles.

Best Practices

  • Always test the restored/archived Delta tables in COOL ADLS before deleting the HOT copy.

  • Ensure you have permissions for deep copy in both ADLS accounts.

  • For continued new data production, automate periodic archiving using job scheduling in Databricks or ADF, keeping data lineage traceable.

Summary Table

Approach Delta Log Preserved Ready for Restore Efficient for Archive Complexity
Full Delta Table Copy Yes Yes No Medium
Export as Parquet No No Yes Low
Delta โ€œCloneโ€ or โ€œBackupโ€ Script Yes Yes Yes High
 
 

Using these methods, you can successfully manage the lifecycle of data between HOT and COOL ADLS accounts for your Delta Lake environments while maintaining transaction fidelity and efficient archive processes.