Hi @ha2hi,
As @balajij8 has highlighted, Auto Loader does keep file metadata/state in the checkpoint location (backed by RocksDB), so for long-running or high-volume streams, the checkpoint state can grow over time. Databricks specifically recommends cloudFiles.maxFileAge if you want to prevent file state from growing without limits. One nuance is that expired entries first appear as tombstones, so storage usage can temporarily increase before it levels off.
I would not recommend manually deleting checkpoint files for periodic cleanup. Databricks recommends keeping checkpoints in a location without a lifecycle policy because if checkpoint files are cleaned up, the stream state can be corrupted. More generally, if you delete the checkpoint directory or switch to a new checkpoint location, the next run starts fresh.
If the concern is actually that processed source files are piling up in the landing location, that is a separate problem from checkpoint growth. In that case, you can use cloudFiles.cleanSource with MOVE or DELETE to manage source-file retention after ingestion.
This is a good page to refer to.
If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.
Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***