3 weeks ago
Dear Databricks experts,
I encountered the following error in Databricks:
`com.databricks.sql.transaction.tahoe.DeltaFileNotFoundException: [DELTA_EMPTY_DIRECTORY] No file found in the directory: gs://cimb-prod-lakehouse/bronze-layer/losdb/pl_message/_delta_log.`
This issue occurred after running a **Vacuum** operation. Despite continuous data ingestion, I noticed that there were no changes reflected in the Delta log (`_delta_log`). This raises a few questions:
1. Why does the **Vacuum** operation delete essential files, such as those required for `_delta_log`, leading to this error?
2. How can data ingestion continue without updates being recorded in the Delta log?
3. Is there a way to ensure that necessary files are retained during Vacuum to avoid such issues?
Currently, I have managed to fix the issue by identifying the last valid version after the Vacuum process and reading from that version. Since I am using readChangeFeed. I can read from the latest version if a new issue arises. However, I would like to better understand the root cause and how to prevent this problem in the future.
Thank you for your guidance!
3 weeks ago
behavior is not a bug but rather an inherent aspect of how the VACUUM operation functions. VACUUM do not delete from _delta_log folder, this folder has its own default retention of 30 days:
so upto you to decide how much time travel or versioning you want for your data as data files takes storage space into consideration so keeping default 7 days is good , anything greater than this incurs storage cost and lesser is the risk of having very less retention. matching to 30 days as _delta_logs is good but cost and your usecase applies here.
3 weeks ago
The error you're encountering, com.databricks.sql.transaction.tahoe.DeltaFileNotFoundException: [DELTA_EMPTY_DIRECTORY] No file found in the directory: gs://cimb-prod-lakehouse/bronze-layer/losdb/pl_message/_delta_log, indicates that the _delta_log directory is empty or missing, which is critical for Delta Lake operations. This issue can arise due to improper use of the VACUUM operation.
VACUUM my_table RETAIN 168 HOURS; -- Retain files for 7 days
3 weeks ago
Hi @saurabh18cs ,
Thank you for your explanation regarding the VACUUM operation and the error I encountered. I appreciate your insights.
I would like to clarify further: why does the VACUUM feature sometimes delete files that are still necessary and being referenced? Is this behavior considered a bug, or is it an inherent aspect of how the VACUUM operation functions? Understanding this will help me better manage the retention period and prevent future issues.
Hi @VZLA , I would appreciate it if you could let me know your thoughts on this matter.
Thank you for your assistance!
3 weeks ago
Hi @minhhung0507 ,
You must choose an interval that is longer than the longest running concurrent transaction and the longest period that any stream can lag behind the most recent update to the table. So that Vaccum tables cannot be corrupted when VACUUM deletes files that have not yet been committed.
And also there is a safety check to check whether there are no operations being performed on this table that take longer than the retention interval you plan to specify, you can turn off/on this safety check by setting the Spark configuration property spark.databricks.delta.retentionDurationCheck.enabled to false.
Hope this helps!!!
3 weeks ago
behavior is not a bug but rather an inherent aspect of how the VACUUM operation functions. VACUUM do not delete from _delta_log folder, this folder has its own default retention of 30 days:
so upto you to decide how much time travel or versioning you want for your data as data files takes storage space into consideration so keeping default 7 days is good , anything greater than this incurs storage cost and lesser is the risk of having very less retention. matching to 30 days as _delta_logs is good but cost and your usecase applies here.
3 weeks ago
Hi @saurabh18cs ,
Thanks for that very detailed explanation. I will take note and continue to observe this case.
3 weeks ago
Hi @minhhung0507,
The VACUUM command on a Delta table does not delete the _delta_log folder, as this folder contains all the metadata related to the Delta table. The _delta_log folder acts as a pointer where all changes are tracked. In the event that the _delta_log folder is accidentally deleted, it cannot be recovered unless bucket versioning enabled. If versioning is enabled, you can restore the deleted files and run the FSCK REPAIR command to fix the Delta table. However, it's important to understand how Delta performs the FSCK operation under the hood.
For more understanding on Vacuum, refer following link VACUUM | Databricks on Google Cloud
If you are still facing issue to query table cause of missing parquet files, you can fix it by running following command, refer following link FSCK REPAIR TABLE | Databricks on Google Cloud
FSCK REPAIR TABLE table_name [DRY RUN]
Regards,
Hari Prasad
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group