cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Issue with DeltaFileNotFoundException After Vacuum and Missing Data Changes in Delta Log

minhhung0507
Contributor

Dear Databricks experts,

I encountered the following error in Databricks:

`com.databricks.sql.transaction.tahoe.DeltaFileNotFoundException: [DELTA_EMPTY_DIRECTORY] No file found in the directory: gs://cimb-prod-lakehouse/bronze-layer/losdb/pl_message/_delta_log.`

This issue occurred after running a **Vacuum** operation. Despite continuous data ingestion, I noticed that there were no changes reflected in the Delta log (`_delta_log`). This raises a few questions:

1. Why does the **Vacuum** operation delete essential files, such as those required for `_delta_log`, leading to this error?
2. How can data ingestion continue without updates being recorded in the Delta log?
3. Is there a way to ensure that necessary files are retained during Vacuum to avoid such issues?

Currently, I have managed to fix the issue by identifying the last valid version after the Vacuum process and reading from that version. Since I am using readChangeFeed. I can read from the latest version if a new issue arises. However, I would like to better understand the root cause and how to prevent this problem in the future.

Thank you for your guidance!

minhhung0507_2-1736940030237.png

 

 

 

Hung Nguyen
1 ACCEPTED SOLUTION

Accepted Solutions

saurabh18cs
Valued Contributor III

Hi @minhhung0507 

behavior is not a bug but rather an inherent aspect of how the VACUUM operation functions. VACUUM do not delete from _delta_log folder, this folder has its own default retention of 30 days:

  • Delta Lake maintains a transaction log (_delta_log directory) that records all changes to the table. This log ensures ACID transactions and allows for time travel and versioning.
  • The transaction log contains metadata about the files that make up the table at any given point in time.

 

so upto you to decide how much time travel or versioning you want for your data as data files takes storage space into consideration so keeping default 7 days is good , anything greater than this incurs storage cost and lesser is the risk of having very less retention. matching to 30 days as _delta_logs is good but cost and your usecase applies here.

View solution in original post

6 REPLIES 6

saurabh18cs
Valued Contributor III

The error you're encountering, com.databricks.sql.transaction.tahoe.DeltaFileNotFoundException: [DELTA_EMPTY_DIRECTORY] No file found in the directory: gs://cimb-prod-lakehouse/bronze-layer/losdb/pl_message/_delta_log, indicates that the _delta_log directory is empty or missing, which is critical for Delta Lake operations. This issue can arise due to improper use of the VACUUM operation.

  • The VACUUM operation in Delta Lake is used to remove old files that are no longer needed for the current state of the table. However, if the retention period is set too short, it can inadvertently delete files that are still needed for the Delta table's metadata and transaction log.
  • The default retention period for VACUUM is 7 days. If you set a shorter retention period, you risk deleting files that are still required.
  • If the _delta_log directory is missing or corrupted, Delta Lake cannot properly record transactions. This can lead to inconsistencies and errors during data ingestion and querying.

VACUUM my_table RETAIN 168 HOURS; -- Retain files for 7 days

minhhung0507
Contributor

Hi @saurabh18cs , 

Thank you for your explanation regarding the VACUUM operation and the error I encountered. I appreciate your insights.

I would like to clarify further: why does the VACUUM feature sometimes delete files that are still necessary and being referenced? Is this behavior considered a bug, or is it an inherent aspect of how the VACUUM operation functions? Understanding this will help me better manage the retention period and prevent future issues.

Hi @VZLA , I would appreciate it if you could let me know your thoughts on this matter.

Thank you for your assistance!

Hung Nguyen

Hi @minhhung0507 ,

You must choose an interval that is longer than the longest running concurrent transaction and the longest period that any stream can lag behind the most recent update to the table. So that Vaccum  tables cannot be corrupted when VACUUM deletes files that have not yet been committed.

And also there is a safety check to check whether there are no operations being performed on this table that take longer than the retention interval you plan to specify, you can turn off/on this safety check by setting the Spark configuration property spark.databricks.delta.retentionDurationCheck.enabled to false.

Hope this helps!!!

saurabh18cs
Valued Contributor III

Hi @minhhung0507 

behavior is not a bug but rather an inherent aspect of how the VACUUM operation functions. VACUUM do not delete from _delta_log folder, this folder has its own default retention of 30 days:

  • Delta Lake maintains a transaction log (_delta_log directory) that records all changes to the table. This log ensures ACID transactions and allows for time travel and versioning.
  • The transaction log contains metadata about the files that make up the table at any given point in time.

 

so upto you to decide how much time travel or versioning you want for your data as data files takes storage space into consideration so keeping default 7 days is good , anything greater than this incurs storage cost and lesser is the risk of having very less retention. matching to 30 days as _delta_logs is good but cost and your usecase applies here.

Hi @saurabh18cs ,

Thanks for that very detailed explanation. I will take note and continue to observe this case.

Hung Nguyen

hari-prasad
Valued Contributor II

Hi @minhhung0507,

The VACUUM command on a Delta table does not delete the _delta_log folder, as this folder contains all the metadata related to the Delta table. The _delta_log folder acts as a pointer where all changes are tracked. In the event that the _delta_log folder is accidentally deleted, it cannot be recovered unless bucket versioning enabled. If versioning is enabled, you can restore the deleted files and run the FSCK REPAIR command to fix the Delta table. However, it's important to understand how Delta performs the FSCK operation under the hood.

For more understanding on Vacuum, refer following link VACUUM | Databricks on Google Cloud

 

If you are still facing issue to query table cause of missing parquet files, you can fix it by running following command, refer following link FSCK REPAIR TABLE | Databricks on Google Cloud

FSCK REPAIR TABLE table_name [DRY RUN]

 

Regards,
Hari Prasad

 



Regards,
Hari Prasad

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group