There appears to be a recurring issue with Delta Live Table (DLT) pipelines in Databricks where the checkpoint is unexpectedly stored in the dbfs:/ path, rather than in the intended external storage location (such as Azure Blob Storage or ADLS). This behavior causes the DLT pipeline to perform a full refresh, as the checkpoint is not being consistently tracked, resulting in a reset of all tables and incremental data processing.โ
Nature of the Issue
Delta Live Tables by default store checkpoint information in dbfs:/delta/ within the Databricks file system if no explicit external location is configured. If your pipeline writes data to an external Azure Storage Account, but checkpoints remain stored in dbfs:/, Databricks may treat every run as a full refresh rather than using incremental updates. This happens because the system cannot locate previous checkpoint data in the proper external storage context, and therefore it processes all data from scratch on each run.โ
Is This a Databricks Bug?
This issue has been reported by multiple users in Databricks community forums and Stack Overflow, suggesting it's either an unexpected configuration behavior or a product bugโespecially if you did not modify the default storage location and expected correct checkpointing out-of-the-box. There is no formal acknowledgment from Databricks of a platform-wide bug, but there is widespread advice to explicitly set the checkpoint location in your DLT pipeline configuration to prevent this problem.โ
Recommended Fixes and Best Practices
-
Set the Checkpoint Location Explicitly: Navigate to your DLT pipeline settings in the Databricks workspace, find โAdvanced settings,โ and set the โCheckpoint locationโ to your external storage path (e.g., Azure Blob Storage or ADLS).โ
-
Restart Workflow: If malfunctions persist, stop pipeline execution, delete existing checkpoint files in the external storage, clear pipeline cache, and restart the execution to reinitialize checkpointing.โ
-
Avoid Full Refreshes Unless Necessary: Running a full refresh on streaming tables will reset all state processing and stored checkpoints, potentially causing data loss if input data is no longer available.โ
-
Follow Best Practices: For scalability and reliability, always configure checkpoint storage outside of dbfs:/ when working on production workloads with external data sources.โ
Next Steps
-
Verify your pipelineโs configuration settings and set the explicit checkpoint location.
-
If issues persist even after corrective configuration, consider reaching out to Databricks support with detailed logs and pipeline configuration for deeper troubleshooting.
-
Monitor Databricks release notes or community discussions for updates on this potential bug.
If handled proactively, these settings adjustments greatly reduce the chance of encountering unwanted full refreshes or loss of incremental processing in DLT pipelines.โ