โ08-12-2022 09:42 AM
AWS by the way, if that matters. We have an old production table that has been running in the background for a couple of years, always with auto-optimize and auto-compaction turned off. Since then, it has written many small files (like 10,000 an hour that are ~100kb each).
Also the _delta_log folder has grown to ~170TB, mostly of files with the name `***.checkpoint.***.parquet`, that it has been creating every time we write to the table (hourly batch job, checkpoints are ~50GB an hour). So basically I want to get rid of that 170TB, and ideally also clean up/optimize the many small files. (also we don't really care about time travel for this table)
I've tried setting `delta.logRetentionDuration` to 30 days and running `VACUUM`, which cleaned up a little, but there are still a bunch of checkpoints in the _delta_log folder that are up to 6 months old. But I'd still like to get rid of (almost) all of them.
I've tried running `OPTIMIZE x where utc_date = '2020-01-01` (data is partitioned on utc_date) on a single day of data, but it took half an hour, which extrapolating to running it on the entire table would take weeks...
I've also tried copying a single day of data to a new table which was much faster (1-2 min per day), which lead me to think a good option may be to just copy all the data to a new table.
So to summarize, what do you think is the best/most efficient way to to fix this table, and to deal with such a blown up delta log?
โ08-12-2022 11:57 AM
I would go with the copy. I would go for a quick solution: set a job that runs copies one day, then OPTIMIZE that partition, and then execute the next job with parameter date+1 day. In the future, you can set night job which runs OPTIMIZE and VACUUM.
At the data+ai summit, it was mentioned that it will be available to set a smaller number of checkpoints but I don't see it yet (so less often than 10 commits).
โ08-12-2022 11:57 AM
I would go with the copy. I would go for a quick solution: set a job that runs copies one day, then OPTIMIZE that partition, and then execute the next job with parameter date+1 day. In the future, you can set night job which runs OPTIMIZE and VACUUM.
At the data+ai summit, it was mentioned that it will be available to set a smaller number of checkpoints but I don't see it yet (so less often than 10 commits).
โ09-11-2022 12:08 AM
Hey there @Fenno Vermeijโ
Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.
We'd love to hear from you.
Thanks!
โ09-22-2022 02:01 AM
Thanks! we did indeed solve it by copying the data to a new table!
The reason why the vacuum didn't initially work was because we had not set `'delta.logRetentionDuration'`, after setting that to 14 days and running VACUUM, the old logs were cleaned up.
On the new table we set autooptimize and autocompaction, and everything is looking good (although the copying did take a few days since we had MANY files)
โ07-12-2024 05:16 AM
Sometime, if we have less commit versions for a delta table, it won't create checkpoint files in the table. Checkpoint file is responsible to trigger the log cleanup activities. In case, you observe that there are no checkpoint files available for the recent versions, you can use delta.checkpointInterval to reduce the number of commits after it creates a checkpoint. Once it gets the new checkpoint, the log cleanup activity will be triggered.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group