Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @MikeGo,

The two properties together do exactly what you want, but they only set the eligibility threshold for cleanup. They don't actively delete anything themselves. Old log files get pruned when the next checkpoint is written (every 10 commits by default), and old data files only go away when you actually run VACUUM. So if you set the properties and your raw table is still receiving commits from Kafka, the next checkpoint will trigger the log pruning, and your trigger
initialisation should succeed shortly after. If the table is idle, you may need to force a small commit or run VACUUM to trigger the cleanup.

The bigger thing to be careful about is that delta.deletedFileRetentionDuration = 1 hour is genuinely aggressive. Anything that reads the table with a state pointing to a version older than one hour will break. Concretely, that includes any structured streaming consumer of the raw table whose checkpoint is lagging by more than an hour, any time travel queries (VERSION AS OF, TIMESTAMP AS OF), any CDF reads against older versions, and any other downstream trigger that was relying on commit history. If your downstream is fully serverless and lightweight, this is probably fine, but in a busy environment with multiple consumers, it can cause cascading failures. Resetting the properties back afterwards is the right move, and I would add running a VACUUM once before resetting, so the cleanup actually completes within the short retention window.

For ongoing hygiene rather than a one-shot fix, the more sustainable approach is to leave delta.logRetentionDuration at something modest like one to seven days and ensure checkpoints are being written regularly. The Kafka to raw streaming job's micro-batch interval is the lever there. Larger, less frequent batches reduce log churn at the source and keep the trigger initialisation fast without you having to flip table properties on and off. See Delta table properties and Work with Delta Lake table history for the full set of related settings.

On the staging table idea you abandoned, you were right to be uncomfortable about the two-write transaction problem. Worth noting that you can sidestep it entirely by reading the raw table's commits with Delta change data feed into the staging table as a streaming job, so the staging table is derived from raw rather than written in parallel with it. Atomicity stops mattering because there's only one source of truth. That said, if the property toggle is working,
sticking with it is simpler.

Hope this helps.

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***