cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

"desc history" shows versions older than the default logRetentionDuration of 30 days

LavaLiah_85929
New Contributor II

I have a cdc enabled table where no data changes were made since July 28. Then updates started occurring from November 22 onwards. The first checkpoint occurred on Nov 28. Based on the corresponding timestamp of checkpoint and log files, it looks like there were 20 logs from July 28 to Nov28 when the checkpoint occurred.

  1. The checkpoint is somehow associated with the July 28 log number which does not look right.
  2. Why did the checkpoint not occur after 10 transactions.

The "desc history" and cdc table_changes function are incorrectly showing the July 28 version hence not matching the default logRetentionDuration of 30 days.

Please see the attached file containing a few snippets.

Can someone comment on the behavior we are noticing?

2 REPLIES 2

shyam_9
Databricks Employee
Databricks Employee

Hi @Laval Liahkim​, could you please try running the VACUUM with 30 days retention?

Please confirm when you last run the cmd with the 30-day retention period. Also, when you created this table and do you see old data files were deleted?

Also, when disk caching is enabled, a cluster might contain data from Parquet files that have been deleted with the vacuum cmd. Therefore, it may be possible to query the data of previous table versions whose files have been deleted. Restarting the cluster will remove the cached data.

When I posted the question, I did run vacuum on the table with 30 days retention. I just checked now and the issue seems to be resolved. The first json checkpoint file is now dated Dec 12 and it matches the first version in desc history. I'm guessing the checkpointing process is adding some buffer to the x days retention period.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group