cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Bug Delta Live Tables - Checkpoint

antoniomf
New Contributor

Hello, I've encountered an issue with Delta Live Table in both my Development and Production Workspaces. The data is arriving correctly in my Azure Storage Account; however, the checkpoint is being stored in the path dbfs:/. I haven't modified the Storage Location, in fact, the data is being written to the tables correctly. The problem is that it's performing a full refresh since the checkpoint has started from scratch. Is there a bug in Databricks?

1 REPLY 1

mark_ott
Databricks Employee
Databricks Employee

There appears to be a recurring issue with Delta Live Table (DLT) pipelines in Databricks where the checkpoint is unexpectedly stored in the dbfs:/ path, rather than in the intended external storage location (such as Azure Blob Storage or ADLS). This behavior causes the DLT pipeline to perform a full refresh, as the checkpoint is not being consistently tracked, resulting in a reset of all tables and incremental data processing.โ€‹

Nature of the Issue

Delta Live Tables by default store checkpoint information in dbfs:/delta/ within the Databricks file system if no explicit external location is configured. If your pipeline writes data to an external Azure Storage Account, but checkpoints remain stored in dbfs:/, Databricks may treat every run as a full refresh rather than using incremental updates. This happens because the system cannot locate previous checkpoint data in the proper external storage context, and therefore it processes all data from scratch on each run.โ€‹

Is This a Databricks Bug?

This issue has been reported by multiple users in Databricks community forums and Stack Overflow, suggesting it's either an unexpected configuration behavior or a product bugโ€”especially if you did not modify the default storage location and expected correct checkpointing out-of-the-box. There is no formal acknowledgment from Databricks of a platform-wide bug, but there is widespread advice to explicitly set the checkpoint location in your DLT pipeline configuration to prevent this problem.โ€‹

Recommended Fixes and Best Practices

  • Set the Checkpoint Location Explicitly: Navigate to your DLT pipeline settings in the Databricks workspace, find โ€œAdvanced settings,โ€ and set the โ€œCheckpoint locationโ€ to your external storage path (e.g., Azure Blob Storage or ADLS).โ€‹

  • Restart Workflow: If malfunctions persist, stop pipeline execution, delete existing checkpoint files in the external storage, clear pipeline cache, and restart the execution to reinitialize checkpointing.โ€‹

  • Avoid Full Refreshes Unless Necessary: Running a full refresh on streaming tables will reset all state processing and stored checkpoints, potentially causing data loss if input data is no longer available.โ€‹

  • Follow Best Practices: For scalability and reliability, always configure checkpoint storage outside of dbfs:/ when working on production workloads with external data sources.โ€‹

Next Steps

  • Verify your pipelineโ€™s configuration settings and set the explicit checkpoint location.

  • If issues persist even after corrective configuration, consider reaching out to Databricks support with detailed logs and pipeline configuration for deeper troubleshooting.

  • Monitor Databricks release notes or community discussions for updates on this potential bug.

If handled proactively, these settings adjustments greatly reduce the chance of encountering unwanted full refreshes or loss of incremental processing in DLT pipelines.โ€‹