Databricks Community

FennVerm_60454 · ‎08-12-2022

AWS by the way, if that matters. We have an old production table that has been running in the background for a couple of years, always with auto-optimize and auto-compaction turned off. Since then, it has written many small files (like 10,000 an hour that are ~100kb each).

Also the _delta_log folder has grown to ~170TB, mostly of files with the name `***.checkpoint.***.parquet`, that it has been creating every time we write to the table (hourly batch job, checkpoints are ~50GB an hour). So basically I want to get rid of that 170TB, and ideally also clean up/optimize the many small files. (also we don't really care about time travel for this table)

I've tried setting `delta.logRetentionDuration` to 30 days and running `VACUUM`, which cleaned up a little, but there are still a bunch of checkpoints in the _delta_log folder that are up to 6 months old. But I'd still like to get rid of (almost) all of them.

I've tried running `OPTIMIZE x where utc_date = '2020-01-01` (data is partitioned on utc_date) on a single day of data, but it took half an hour, which extrapolating to running it on the entire table would take weeks...

I've also tried copying a single day of data to a new table which was much faster (1-2 min per day), which lead me to think a good option may be to just copy all the data to a new table.

So to summarize, what do you think is the best/most efficient way to to fix this table, and to deal with such a blown up delta log?

Hubert-Dudek · ‎08-12-2022

I would go with the copy. I would go for a quick solution: set a job that runs copies one day, then OPTIMIZE that partition, and then execute the next job with parameter date+1 day. In the future, you can set night job which runs OPTIMIZE and VACUUM.

At the data+ai summit, it was mentioned that it will be available to set a smaller number of checkpoints but I don't see it yet (so less often than 10 commits).

View solution in original post

Hubert-Dudek · ‎08-12-2022

I would go with the copy. I would go for a quick solution: set a job that runs copies one day, then OPTIMIZE that partition, and then execute the next job with parameter date+1 day. In the future, you can set night job which runs OPTIMIZE and VACUUM.

At the data+ai summit, it was mentioned that it will be available to set a smaller number of checkpoints but I don't see it yet (so less often than 10 commits).

Vidula · ‎09-11-2022

Hey there @Fenno Vermeij

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.

We'd love to hear from you.

Thanks!

FennVerm_60454 · ‎09-22-2022

Thanks! we did indeed solve it by copying the data to a new table!

The reason why the vacuum didn't initially work was because we had not set `'delta.logRetentionDuration'`, after setting that to 14 days and running VACUUM, the old logs were cleaned up.

On the new table we set autooptimize and autocompaction, and everything is looking good (although the copying did take a few days since we had MANY files)

siddhathPanchal · ‎07-12-2024

Sometime, if we have less commit versions for a delta table, it won't create checkpoint files in the table. Checkpoint file is responsible to trigger the log cleanup activities. In case, you observe that there are no checkpoint files available for the recent versions, you can use delta.checkpointInterval to reduce the number of commits after it creates a checkpoint. Once it gets the new checkpoint, the log cleanup activity will be triggered.

Databricks Community

How to clean up extremely large delta log checkpoints and many small files?

Connect with Databricks Users in Your Area

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences

Databricks Community Champion - December 2024 - Sujesh Menon