- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-15-2025 03:20 AM
Dear Databricks experts,
I encountered the following error in Databricks:
`com.databricks.sql.transaction.tahoe.DeltaFileNotFoundException: [DELTA_EMPTY_DIRECTORY] No file found in the directory: gs://cimb-prod-lakehouse/bronze-layer/losdb/pl_message/_delta_log.`
This issue occurred after running a **Vacuum** operation. Despite continuous data ingestion, I noticed that there were no changes reflected in the Delta log (`_delta_log`). This raises a few questions:
1. Why does the **Vacuum** operation delete essential files, such as those required for `_delta_log`, leading to this error?
2. How can data ingestion continue without updates being recorded in the Delta log?
3. Is there a way to ensure that necessary files are retained during Vacuum to avoid such issues?
Currently, I have managed to fix the issue by identifying the last valid version after the Vacuum process and reading from that version. Since I am using readChangeFeed. I can read from the latest version if a new issue arises. However, I would like to better understand the root cause and how to prevent this problem in the future.
Thank you for your guidance!
Hung Nguyen
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-16-2025 02:02 AM
behavior is not a bug but rather an inherent aspect of how the VACUUM operation functions. VACUUM do not delete from _delta_log folder, this folder has its own default retention of 30 days:
- Delta Lake maintains a transaction log (_delta_log directory) that records all changes to the table. This log ensures ACID transactions and allows for time travel and versioning.
- The transaction log contains metadata about the files that make up the table at any given point in time.
so upto you to decide how much time travel or versioning you want for your data as data files takes storage space into consideration so keeping default 7 days is good , anything greater than this incurs storage cost and lesser is the risk of having very less retention. matching to 30 days as _delta_logs is good but cost and your usecase applies here.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-15-2025 05:23 AM
The error you're encountering, com.databricks.sql.transaction.tahoe.DeltaFileNotFoundException: [DELTA_EMPTY_DIRECTORY] No file found in the directory: gs://cimb-prod-lakehouse/bronze-layer/losdb/pl_message/_delta_log, indicates that the _delta_log directory is empty or missing, which is critical for Delta Lake operations. This issue can arise due to improper use of the VACUUM operation.
- The VACUUM operation in Delta Lake is used to remove old files that are no longer needed for the current state of the table. However, if the retention period is set too short, it can inadvertently delete files that are still needed for the Delta table's metadata and transaction log.
- The default retention period for VACUUM is 7 days. If you set a shorter retention period, you risk deleting files that are still required.
- If the _delta_log directory is missing or corrupted, Delta Lake cannot properly record transactions. This can lead to inconsistencies and errors during data ingestion and querying.
VACUUM my_table RETAIN 168 HOURS; -- Retain files for 7 days
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-15-2025 07:41 PM
Hi @saurabh18cs ,
Thank you for your explanation regarding the VACUUM operation and the error I encountered. I appreciate your insights.
I would like to clarify further: why does the VACUUM feature sometimes delete files that are still necessary and being referenced? Is this behavior considered a bug, or is it an inherent aspect of how the VACUUM operation functions? Understanding this will help me better manage the retention period and prevent future issues.
Hi @VZLA , I would appreciate it if you could let me know your thoughts on this matter.
Thank you for your assistance!
Hung Nguyen
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-15-2025 07:59 PM
Hi @minhhung0507 ,
You must choose an interval that is longer than the longest running concurrent transaction and the longest period that any stream can lag behind the most recent update to the table. So that Vaccum tables cannot be corrupted when VACUUM deletes files that have not yet been committed.
And also there is a safety check to check whether there are no operations being performed on this table that take longer than the retention interval you plan to specify, you can turn off/on this safety check by setting the Spark configuration property spark.databricks.delta.retentionDurationCheck.enabled to false.
Hope this helps!!!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-16-2025 02:02 AM
behavior is not a bug but rather an inherent aspect of how the VACUUM operation functions. VACUUM do not delete from _delta_log folder, this folder has its own default retention of 30 days:
- Delta Lake maintains a transaction log (_delta_log directory) that records all changes to the table. This log ensures ACID transactions and allows for time travel and versioning.
- The transaction log contains metadata about the files that make up the table at any given point in time.
so upto you to decide how much time travel or versioning you want for your data as data files takes storage space into consideration so keeping default 7 days is good , anything greater than this incurs storage cost and lesser is the risk of having very less retention. matching to 30 days as _delta_logs is good but cost and your usecase applies here.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-16-2025 07:21 PM
Hi @saurabh18cs ,
Thanks for that very detailed explanation. I will take note and continue to observe this case.
Hung Nguyen
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-16-2025 01:36 AM
Hi @minhhung0507,
The VACUUM command on a Delta table does not delete the _delta_log folder, as this folder contains all the metadata related to the Delta table. The _delta_log folder acts as a pointer where all changes are tracked. In the event that the _delta_log folder is accidentally deleted, it cannot be recovered unless bucket versioning enabled. If versioning is enabled, you can restore the deleted files and run the FSCK REPAIR command to fix the Delta table. However, it's important to understand how Delta performs the FSCK operation under the hood.
For more understanding on Vacuum, refer following link VACUUM | Databricks on Google Cloud
If you are still facing issue to query table cause of missing parquet files, you can fix it by running following command, refer following link FSCK REPAIR TABLE | Databricks on Google Cloud
FSCK REPAIR TABLE table_name [DRY RUN]
Regards,
Hari Prasad
Regards,
Hari Prasad

