Databricks Community

antoniomf · ‎03-14-2025

Hello, I've encountered an issue with Delta Live Table in both my Development and Production Workspaces. The data is arriving correctly in my Azure Storage Account; however, the checkpoint is being stored in the path dbfs:/. I haven't modified the Storage Location, in fact, the data is being written to the tables correctly. The problem is that it's performing a full refresh since the checkpoint has started from scratch. Is there a bug in Databricks?

mark_ott · 3 weeks ago

There appears to be a recurring issue with Delta Live Table (DLT) pipelines in Databricks where the checkpoint is unexpectedly stored in the dbfs:/ path, rather than in the intended external storage location (such as Azure Blob Storage or ADLS). This behavior causes the DLT pipeline to perform a full refresh, as the checkpoint is not being consistently tracked, resulting in a reset of all tables and incremental data processing.

Nature of the Issue

Delta Live Tables by default store checkpoint information in dbfs:/delta/ within the Databricks file system if no explicit external location is configured. If your pipeline writes data to an external Azure Storage Account, but checkpoints remain stored in dbfs:/, Databricks may treat every run as a full refresh rather than using incremental updates. This happens because the system cannot locate previous checkpoint data in the proper external storage context, and therefore it processes all data from scratch on each run.

Is This a Databricks Bug?

This issue has been reported by multiple users in Databricks community forums and Stack Overflow, suggesting it's either an unexpected configuration behavior or a product bug—especially if you did not modify the default storage location and expected correct checkpointing out-of-the-box. There is no formal acknowledgment from Databricks of a platform-wide bug, but there is widespread advice to explicitly set the checkpoint location in your DLT pipeline configuration to prevent this problem.

Recommended Fixes and Best Practices

Set the Checkpoint Location Explicitly: Navigate to your DLT pipeline settings in the Databricks workspace, find “Advanced settings,” and set the “Checkpoint location” to your external storage path (e.g., Azure Blob Storage or ADLS).
Restart Workflow: If malfunctions persist, stop pipeline execution, delete existing checkpoint files in the external storage, clear pipeline cache, and restart the execution to reinitialize checkpointing.
Avoid Full Refreshes Unless Necessary: Running a full refresh on streaming tables will reset all state processing and stored checkpoints, potentially causing data loss if input data is no longer available.
Follow Best Practices: For scalability and reliability, always configure checkpoint storage outside of dbfs:/ when working on production workloads with external data sources.

Next Steps

Verify your pipeline’s configuration settings and set the explicit checkpoint location.
If issues persist even after corrective configuration, consider reaching out to Databricks support with detailed logs and pipeline configuration for deeper troubleshooting.
Monitor Databricks release notes or community discussions for updates on this potential bug.

If handled proactively, these settings adjustments greatly reduce the chance of encountering unwanted full refreshes or loss of incremental processing in DLT pipelines.