Databricks Community

YuriS · ‎10-07-2025

Let’s say we have big data application where data loss is not an option.

Having GZRS (geo-zone-redundant storage) redundancy we would achieve zero data loss if primary region is alive – writer is waiting for acks from two or more Azure availability zones in the primary region (if only one zone is available – write will not be successful). Once data is being written to the primary region it is being copied asynchronously to the secondary region, - hence when primary is down (flood/asteroid/you-name-it outage) there is a possibility of losing the data.

Microsoft states that data is being copied asynchronously, however it seems like order of files is not guaranteed to be preserved. That means if your primary region is down and you are failing over on secondary – there is high probability that some/most of your delta tables would be inconsistent. Imagine delta-log files were successfully copied to the secondary, but some parquet files are missing – i.e. table is inconsistent or out-of-sync – reads would result into error. Or your parquet files are copied, but delta-log is not there yet – congratulations – you have lost some data and you would not even notice it as reads would be successful (i.e. silent data-loss).

It also seems like RA-GRS (read-access geo-zone-redundant storage) does not play well with Delta Lake due to the same issue with eventual consistency…

And this is not the whole picture yet. Microsoft states that there is Last-Sync-Time property that indicates the most recent time that data from the primary region is guaranteed to have been written to the secondary region. And rumors have it – we should not trust that property as well, as it is unreliable…

Databricks mentions several times in the documentation that Deep Copy functionality should be used to copy deltas table from primary to secondary regions in a consistent way. Sounds good in theory – does not however cover many of the streaming cases.

Let’s take for instance stateless delta-to-delta streaming. Deep Copy does not maintain history when data is being copied (for all tables, not only streaming) – hence some additional work is required on primary to map processed offsets with source table history to be able to re-start on secondary from the new (and correct) offset. Add to that use case stateful streams, delta-to-delta with multiple sources and so on.

Is there any straightforward way to minimize data-loss when failing over from primary to secondary? Has someone managed to implement successful primary-to-secondary failover?

mark_ott · ‎10-08-2025

In Azure and Databricks environments, ensuring zero data loss during a primary-to-secondary failover—especially for Delta Lake/streaming workloads—is extremely challenging due to asynchronous replication, potential ordering issues, and inconsistent states between regions. No fully “push-button” solution currently exists for seamless, streaming-consistent, and history-preserving primary-to-secondary failover, but there are some best practices and caveats critical for architects and teams to understand.

Core Problem: Asynchronous Replication and Consistency

Geo-zone-redundant storage (GZRS) and read-access GZRS (RA-GZRS) replicate data across regions using async replication.
Writes in the primary region get strong zone-level consistency only if multiple zones are alive, but data is not immediately, transactionally present in the secondary region—it may lag, and file ordering is not guaranteed.
Delta Lake tables are sensitive to consistency between metadata (delta logs) and data (parquet files). If a failover occurs mid-replication, tables may be unusable or silently inconsistent (silent data loss).

Microsoft’s "Last-Sync-Time" Property: Reliability Issues

The “Last-Sync-Time” property for geo-replicated accounts is not always reliable and should not be solely trusted for data cutover, as it may not reflect the true point of consistency.

Delta Lake Deep Copy: Why It’s Not Sufficient

Deep Copy in Databricks is the most robust way to move Delta tables across regions with consistency, but it is a batch operation—not suitable for low-latency streaming or near-real-time failover scenarios.
Deep Copy does not preserve streaming offsets, history, or application state, so “stateless delta-to-delta streaming” and especially stateful processes need extra work to reconstruct correct offsets and state after cutover.

Streaming Failover: Typical Gaps

Streaming jobs are difficult: you must map source table history/processed offsets, and resuming jobs on secondary requires nontrivial tracking and reconciliation.
No Azure/Databricks native primitives seamlessly preserve all history, offsets, and state during geo-failover.

Practices to Minimize Data Loss

While there is no single “straightforward” solution for Delta Lake streaming failover, here are methods to minimize risk:

Buffer critical writes and orchestrate explicit checkpoints. Before planned failover, pause streaming and ensure all deltas/committed parquet files are successfully replicated. Validate at file-level, not just “last-sync” markers.
Use Deep Copy periodically for batch tables, but supplement with logic to store stream offsets and handle recovery mapping between source/target tables for streaming use cases.
Implement application-level replication/checkpoint tracking: Record every write operation’s offset and status, and maintain a custom log in a replicated store for reconciliation post-failover.
Manual failback procedures: Post-failover, validate Delta tables for consistency (e.g., using Delta Lake’s vacuum/checkpoint tools), and re-validate streaming inputs before resuming.
Consider alternative architectures for sub-second RPO: For mission-critical data, synchronous cross-region replication (not currently supported for Azure Blob/Delta Lake) or third-party solutions (such as custom mirroring, or hybrid storage) may be required. These increase costs and complexity.
Regularly test failover and document recovery/runbooks. Don’t rely solely on cloud provider guarantees; end-to-end validation is a must.

Failover Success: Is It Achievable?

Many organizations accept a low RPO (recovery point objective) and design for eventual consistency rather than true zero data loss.
No documented, fully successful, zero-loss, streaming-integrity-preserving failover for Delta Lake on Azure has been published to date.
Some users have implemented “acceptable” manual or semi-automated failover using Deep Copy plus custom offset/data mapping for critical workloads, but these solutions are complex and error-prone.

In summary:
There is currently no straightforward, out-of-box way to guarantee zero data loss or perfect table consistency when failing over from primary to secondary in Azure for Delta Lake, especially for streaming scenarios. Risk can be mitigated with careful orchestration, explicit checkpointing, custom failover logic, and regular validation and testing, but true zero-loss, instant failover is not supported or achievable with existing Azure or Databricks tooling alone.

View solution in original post

mark_ott · ‎10-08-2025

In Azure and Databricks environments, ensuring zero data loss during a primary-to-secondary failover—especially for Delta Lake/streaming workloads—is extremely challenging due to asynchronous replication, potential ordering issues, and inconsistent states between regions. No fully “push-button” solution currently exists for seamless, streaming-consistent, and history-preserving primary-to-secondary failover, but there are some best practices and caveats critical for architects and teams to understand.

Core Problem: Asynchronous Replication and Consistency

Geo-zone-redundant storage (GZRS) and read-access GZRS (RA-GZRS) replicate data across regions using async replication.
Writes in the primary region get strong zone-level consistency only if multiple zones are alive, but data is not immediately, transactionally present in the secondary region—it may lag, and file ordering is not guaranteed.
Delta Lake tables are sensitive to consistency between metadata (delta logs) and data (parquet files). If a failover occurs mid-replication, tables may be unusable or silently inconsistent (silent data loss).

Microsoft’s "Last-Sync-Time" Property: Reliability Issues

The “Last-Sync-Time” property for geo-replicated accounts is not always reliable and should not be solely trusted for data cutover, as it may not reflect the true point of consistency.

Delta Lake Deep Copy: Why It’s Not Sufficient

Deep Copy in Databricks is the most robust way to move Delta tables across regions with consistency, but it is a batch operation—not suitable for low-latency streaming or near-real-time failover scenarios.
Deep Copy does not preserve streaming offsets, history, or application state, so “stateless delta-to-delta streaming” and especially stateful processes need extra work to reconstruct correct offsets and state after cutover.

Streaming Failover: Typical Gaps

Streaming jobs are difficult: you must map source table history/processed offsets, and resuming jobs on secondary requires nontrivial tracking and reconciliation.
No Azure/Databricks native primitives seamlessly preserve all history, offsets, and state during geo-failover.

Practices to Minimize Data Loss

While there is no single “straightforward” solution for Delta Lake streaming failover, here are methods to minimize risk:

Buffer critical writes and orchestrate explicit checkpoints. Before planned failover, pause streaming and ensure all deltas/committed parquet files are successfully replicated. Validate at file-level, not just “last-sync” markers.
Use Deep Copy periodically for batch tables, but supplement with logic to store stream offsets and handle recovery mapping between source/target tables for streaming use cases.
Implement application-level replication/checkpoint tracking: Record every write operation’s offset and status, and maintain a custom log in a replicated store for reconciliation post-failover.
Manual failback procedures: Post-failover, validate Delta tables for consistency (e.g., using Delta Lake’s vacuum/checkpoint tools), and re-validate streaming inputs before resuming.
Consider alternative architectures for sub-second RPO: For mission-critical data, synchronous cross-region replication (not currently supported for Azure Blob/Delta Lake) or third-party solutions (such as custom mirroring, or hybrid storage) may be required. These increase costs and complexity.
Regularly test failover and document recovery/runbooks. Don’t rely solely on cloud provider guarantees; end-to-end validation is a must.

Failover Success: Is It Achievable?

Many organizations accept a low RPO (recovery point objective) and design for eventual consistency rather than true zero data loss.
No documented, fully successful, zero-loss, streaming-integrity-preserving failover for Delta Lake on Azure has been published to date.
Some users have implemented “acceptable” manual or semi-automated failover using Deep Copy plus custom offset/data mapping for critical workloads, but these solutions are complex and error-prone.

In summary:
There is currently no straightforward, out-of-box way to guarantee zero data loss or perfect table consistency when failing over from primary to secondary in Azure for Delta Lake, especially for streaming scenarios. Risk can be mitigated with careful orchestration, explicit checkpointing, custom failover logic, and regular validation and testing, but true zero-loss, instant failover is not supported or achievable with existing Azure or Databricks tooling alone.

YuriS · ‎10-16-2025

@mark_ott Mark, thank you for so detailed response.
We are currently evaluating two approaches for seamless, lossless failover with full integrity in stateless streaming scenarios.

1) Parse Databricks Offsets files and

2) Enable row tracking

that we’ve identified as the most promising ones.

Parse Databricks Offsets

In nutshell, userMetadata is being updated in every stream’s foreachBatch function to track offsets files. These files are further located on the storage and parsed, so Kafka offsets / source table versions are being tracked.

The downside of that approach is obviously the fact that we are relying on structure of Databricks proprietary offset files. Would you change the structure of these files (i.e. introduce new version) – our disaster recovery process will no longer work.

Enable row tracking

It is not needed for Kafka-to-delta table scenarios as topic-partition-offset triplet is available, but it is needed for delta-to-delta scenarios where we need to know the source table version id to be able to successfully restart streaming on secondary region (as part of Deep Clone approach).

Here we are considering enabling of row tracking feature – in such way we would know the latest committed table versions on secondary – hence dramatically simplifying offsets tracking for delta-to-delta scenarios. However, due to some undocumented reason we cannot use _metadata with readStream, – so in reality that is not even the option to begin with ☹

Also, the downside of that approach is that we would need to explicitly migrate TBs of data to new tables with that feature enabled.

Would be grateful if you could advise us something in relation to highlighted above

Hubert-Dudek · ‎10-17-2025

Databricks is working on improvements and new functionality related to that.
For now, the only solution is a DEEP CLONE. You can run it more frequently or implement your own replication based on a change data feed. You could use delta sharing for that, but the problem is that you will incur additional compute costs for this solution.