cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

[Auto Loader] Inquiry regarding Checkpoint files

ha2hi
Visitor

Hi,

I am currently using Auto Loader to load files stored in the cloud into Databricks tables. I understand that checkpoint files are continuously generated during this process.

I have a couple of questions regarding these files:

  • Do these checkpoint files continue to accumulate indefinitely over time?

  • Is there a way to compress or delete them periodically?

I look forward to hearing from you. Best regards.

1 ACCEPTED SOLUTION

Accepted Solutions

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @ha2hi,

As @balajij8 has highlighted, Auto Loader does keep file metadata/state in the checkpoint location (backed by RocksDB), so for long-running or high-volume streams, the checkpoint state can grow over time. Databricks specifically recommends cloudFiles.maxFileAge if you want to prevent file state from growing without limits. One nuance is that expired entries first appear as tombstones, so storage usage can temporarily increase before it levels off.

I would not recommend manually deleting checkpoint files for periodic cleanup. Databricks recommends keeping checkpoints in a location without a lifecycle policy because if checkpoint files are cleaned up, the stream state can be corrupted. More generally, if you delete the checkpoint directory or switch to a new checkpoint location, the next run starts fresh.

If the concern is actually that processed source files are piling up in the landing location, that is a separate problem from checkpoint growth. In that case, you can use cloudFiles.cleanSource with MOVE or DELETE to manage source-file retention after ingestion.

This is a good page to refer to. 

If this answer resolves your question, could you mark it as โ€œAccept as Solutionโ€? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

View solution in original post

2 REPLIES 2

balajij8
Contributor III

Never delete or alter files inside a checkpoint directory manually as it will corrupt the auto loader streams.

Auto Loader keeps track of discovered files in the checkpoint location using Rocks DB to provide exactly once ingestion guarantees.

  • You can upgrade to Databricks Runtime 17 or above for high volume or long-lived ingestion streams.
  • You can control the size using the cloudFiles.maxFileAge option to expire file events that are older than a particular period. You can keep it to 30 days if possible.
  • You can use Auto Loaderโ€™s cleanSource option. This deletes or archives the source files after they are successfully processed

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @ha2hi,

As @balajij8 has highlighted, Auto Loader does keep file metadata/state in the checkpoint location (backed by RocksDB), so for long-running or high-volume streams, the checkpoint state can grow over time. Databricks specifically recommends cloudFiles.maxFileAge if you want to prevent file state from growing without limits. One nuance is that expired entries first appear as tombstones, so storage usage can temporarily increase before it levels off.

I would not recommend manually deleting checkpoint files for periodic cleanup. Databricks recommends keeping checkpoints in a location without a lifecycle policy because if checkpoint files are cleaned up, the stream state can be corrupted. More generally, if you delete the checkpoint directory or switch to a new checkpoint location, the next run starts fresh.

If the concern is actually that processed source files are piling up in the landing location, that is a separate problem from checkpoint growth. In that case, you can use cloudFiles.cleanSource with MOVE or DELETE to manage source-file retention after ingestion.

This is a good page to refer to. 

If this answer resolves your question, could you mark it as โ€œAccept as Solutionโ€? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***