cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

How to clean up extremely large delta log checkpoints and many small files?

FennVerm_60454
New Contributor II

AWS by the way, if that matters. We have an old production table that has been running in the background for a couple of years, always with auto-optimize and auto-compaction turned off. Since then, it has written many small files (like 10,000 an hour that are ~100kb each).

Also the _delta_log folder has grown to ~170TB, mostly of files with the name `***.checkpoint.***.parquet`, that it has been creating every time we write to the table (hourly batch job, checkpoints are ~50GB an hour). So basically I want to get rid of that 170TB, and ideally also clean up/optimize the many small files. (also we don't really care about time travel for this table)

I've tried setting `delta.logRetentionDuration` to 30 days and running `VACUUM`, which cleaned up a little, but there are still a bunch of checkpoints in the _delta_log folder that are up to 6 months old. But I'd still like to get rid of (almost) all of them.

I've tried running `OPTIMIZE x where utc_date = '2020-01-01` (data is partitioned on utc_date) on a single day of data, but it took half an hour, which extrapolating to running it on the entire table would take weeks...

I've also tried copying a single day of data to a new table which was much faster (1-2 min per day), which lead me to think a good option may be to just copy all the data to a new table.

So to summarize, what do you think is the best/most efficient way to to fix this table, and to deal with such a blown up delta log?

1 ACCEPTED SOLUTION

Accepted Solutions

Hubert-Dudek
Esteemed Contributor III

I would go with the copy. I would go for a quick solution: set a job that runs copies one day, then OPTIMIZE that partition, and then execute the next job with parameter date+1 day. In the future, you can set night job which runs OPTIMIZE and VACUUM.

At the data+ai summit, it was mentioned that it will be available to set a smaller number of checkpoints but I don't see it yet (so less often than 10 commits).

View solution in original post

3 REPLIES 3

Hubert-Dudek
Esteemed Contributor III

I would go with the copy. I would go for a quick solution: set a job that runs copies one day, then OPTIMIZE that partition, and then execute the next job with parameter date+1 day. In the future, you can set night job which runs OPTIMIZE and VACUUM.

At the data+ai summit, it was mentioned that it will be available to set a smaller number of checkpoints but I don't see it yet (so less often than 10 commits).

Vidula
Honored Contributor

Hey there @Fenno Vermeij​ 

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

Thanks! we did indeed solve it by copying the data to a new table!

The reason why the vacuum didn't initially work was because we had not set `'delta.logRetentionDuration'`, after setting that to 14 days and running VACUUM, the old logs were cleaned up.

On the new table we set autooptimize and autocompaction, and everything is looking good (although the copying did take a few days since we had MANY files)

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.