cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Administration & Architecture
Explore discussions on Databricks administration, deployment strategies, and architectural best practices. Connect with administrators and architects to optimize your Databricks environment for performance, scalability, and security.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Implement & Test DR Plan in AWS databricks

APJESK
Contributor

Can you share detailed steps or document for DR setup.

Example Consider Workspace A running in us-east-1, and planning to setup DR on us-west-2

So what are the steps and Task I need to create on AWS us-west-2.

Please share in detail steps that required at AWS level and Databricks Level

 

Thanks

2 REPLIES 2

nayan_wylde
Esteemed Contributor II
  1. Decide Your DR Strategy (Very Important First Step)

Before creating tasks, align on RPO/RTO:

Component

Recommended DR Strategy

Workspace

Cold or Warm Standby (separate workspace in DR region)

Metadata (Unity Catalog)

Multiโ€‘region metastore OR restore from backup

Data (Delta/S3)

Crossโ€‘region replication (CRR)

Notebooks / Jobs

Git-based sync

Secrets

Re-created or synced

Compute

Recreated on demand

  1. AWS-Level Setup (usโ€‘westโ€‘2)

2.1 Networking (VPC)

Create a mirror VPC in us-west-2.

Tasks

  • Create VPC with same CIDR structure (or nonโ€‘overlapping if peered)
  • Private subnets for:
    • Databricks compute
    • PrivateLink endpoints
  • NAT Gateway (for egress)
  • Route tables identical to primary

2.2 VPC Endpoints (Critical)

Create these Interface Endpoints in usโ€‘westโ€‘2:

  • com.amazonaws.us-west-2.s3
  • com.amazonaws.us-west-2.sts
  • com.amazonaws.us-west-2.kinesis (if used)
  • com.amazonaws.us-west-2.logs
  • com.amazonaws.us-west-2.monitoring

For Databricks:

  • Control plane endpoints provided during workspace creation

2.3 IAM (Cross-Region Consistency)

Create identical IAM roles as primary:

  1. a) Databricks Cross-Account Role
  • Trusts Databricks AWS account
  • Same policy permissions:
    • S3 access
    • EC2
    • IAM PassRole
    • Logs, KMS
  1. b) Instance Profile Role
  • Same permissions as primary
  • Attached to EC2 instances

2.4 S3 Data Layer (Most Important Part)

Option A: Centralized S3 (Recommended)

  • One S3 bucket (e.g., s3://org-datalake-prod)
  • Accessible from both regions
  • No replication needed
  • Lower complexity

Option B: Cross-Region Replication (CRR)

If data residency or latency requires regional buckets:

Tasks

  • Enable S3 CRR:
    • us-east-1 โ†’ us-west-2
  • Replicate:
    • Delta tables
    • _delta_log
  • Enable:
    • Versioning
    • Replication for delete markers

2.5 KMS (If Encryption Enabled)

  • Create KMS key in usโ€‘westโ€‘2
  • Update IAM policies to allow use
  • Use regionโ€‘specific CMKs (KMS keys cannot be shared crossโ€‘region)
  1. Databricks-Level Setup (usโ€‘westโ€‘2)

3.1 Create DR Databricks Workspace

Tasks

  • Create workspace in us-west-2
  • Attach to DR VPC
  • Use:
    • Same workspace name + -dr suffix
    • Same account console
  • Enable:
    • Unity Catalog
    • E2 networking
    • PrivateLink (if used)

3.2 Unity Catalog DR Strategy

Option 1: Single Metastore (Advanced)

  • One UC metastore
  • Assign both workspaces
  • Data accessible crossโ€‘region

Option 2: Separate Metastore (Most Common)

Tasks

  • Create new UC metastore in us-west-2
  • Assign DR workspace
  • Create:
    • Same catalogs
    • Same schemas
    • Same external locations

 Recommended for strict regional isolation

3.3 External Locations & Storage Credentials

Recreate exactly as primary:

  • Storage Credentials
  • External Locations
  • Grants

Use automation (Terraform / Databricks CLI)

  1. Replicating Databricks Assets

4.1 Notebooks & Repos

Git is mandatory

Tasks

  • All notebooks stored in Git repos
  • Same repos configured in DR workspace
  • No manual notebooks in workspace

4.2 Jobs & Workflows

Recreate jobs using:

  • Databricks Terraform Provider (best)
  • Or Databricks Jobs API

4.3 Secrets

Secrets are not replicated automatically

Options

  • Recreate manually in DR
  • Sync from:
    • AWS Secrets Manager
    • HashiCorp Vault

4.4 MLflow & Models

  • Store artifacts in:
    • Replicated S3 bucket
  • Recreate:
    • Registered models
    • Permissions
  1. DR Operations Runbook

5.1 Failover Procedure (Primary โ†’ DR)

Trigger Conditions

  • usโ€‘eastโ€‘1 outage
  • Databricks control plane unavailable
  • Business decision

Steps

  1. Freeze ingestion in primary (if possible)
  2. Validate S3 replication status
  3. Enable jobs in DR workspace
  4. Update:
    • DNS / application endpoints
    • Airflow / ADF / external schedulers
  5. Validate:
    • Unity Catalog access
    • Delta table reads/writes
    • Downstream consumers

5.2 Failback (DR โ†’ Primary)

  1. Stop DR ingestion
  2. Allow replication to catch up
  3. Re-enable primary jobs
  4. Validate data consistency
  5. Disable DR jobs

SteveOstrowski
Databricks Employee
Databricks Employee

@APJESK

I have seen this pattern before. disaster recovery planning for Databricks on AWS is a critical topic and one that Databricks has solid documentation and tooling around. Let me walk you through a comprehensive approach to implementing and testing a DR plan.


UNDERSTANDING THE KEY CONCEPTS

Before diving in, it helps to clarify the distinction Databricks makes between High Availability (HA) and Disaster Recovery (DR):

- High Availability is handled within a single region. The Databricks control plane is already resilient to availability zone failures and can automatically recover within about 15 minutes. Compute clusters will restart in a different AZ if their current zone fails.

- Disaster Recovery addresses regional outages and requires explicit planning on your part. This is where your DR plan comes in.

You will also want to define your Recovery Point Objective (RPO) -- the maximum acceptable data loss window -- and your Recovery Time Objective (RTO) -- the maximum acceptable downtime. These will drive your architecture decisions.


CHOOSING A DR STRATEGY

There are two primary patterns:

1. Active-Passive (Recommended for most customers)
- You run production workloads in your primary region
- A secondary workspace in another AWS region is kept synchronized but idle
- During a regional outage, you failover to the secondary workspace
- This is simpler, cheaper, and has a straightforward failover/failback process

2. Active-Active (For maximum availability)
- Both regions run workloads simultaneously
- Jobs are only marked complete after successful execution in BOTH regions
- Requires strict CI/CD pipelines and is more expensive
- Best for organizations with near-zero RTO/RPO requirements


IMPLEMENTATION STEPS

Here is a phased approach:

PHASE 1: PLANNING
- Define your RPO and RTO requirements
- Map all integration points (data sources, downstream consumers, external tools)
- Identify a secondary AWS region that supports all required services (EC2, S3, etc.)
- Plan your communication strategy for failover events

PHASE 2: WORKSPACE REPLICATION
You need to replicate workspace objects to your secondary region. Here is what to sync and how:

Object Type Recommended Approach
---------------------------------------------------------------
Notebook source code CI/CD co-deployment to both regions
Users and Groups Same IdP for both, or SCIM automation
Jobs Deploy to secondary with concurrency=0
Cluster configs Templates in Git, co-deploy
Libraries Source control and cluster templates
Secrets Create in both workspaces via API
Access Controls Co-deploy ACLs via API with ID mapping
Init scripts Store in cloud storage, NOT DBFS root

Key tools for workspace replication:
- Databricks Terraform Provider: Infrastructure-as-code approach to deploy identical workspace configurations across regions. This is the most production-grade approach.
- Databricks Sync (DBSync): Open-source tool from Databricks Labs for backup, restore, and sync of workspace objects. Supports clusters, jobs, notebooks, instance pools, secrets, users, and groups.
GitHub: https://github.com/databrickslabs/databricks-sync
(Note: DBSync is provided for exploration and is not formally supported with SLAs.)
- Databricks REST APIs: For custom automation of object replication.

PHASE 3: DATA REPLICATION
Your data in S3 needs to be available in the secondary region:

- For Delta tables, use Delta Deep Clone for cross-region replication:

CREATE OR REPLACE TABLE dr_region.schema.my_table
CLONE primary_region.schema.my_table;

Deep clones can be run incrementally to sync only new changes.

- For raw data in S3, use AWS S3 Cross-Region Replication (CRR) to keep buckets synchronized.

- IMPORTANT: Do NOT rely solely on S3 geo-redundant storage for DR. Use explicit replication mechanisms.

- Do NOT store production data in the DBFS root bucket -- use external locations in S3 that you control and can replicate.

PHASE 4: STREAMING CONSIDERATIONS
If you use Structured Streaming, special attention is needed:
- Checkpoints contain location-specific metadata
- Store checkpoints in customer-managed S3 (not DBFS) so they can be replicated
- Consider running parallel streaming jobs in the secondary region
- Parameterize source/sink configurations so they can be swapped for DR endpoints

PHASE 5: PARAMETERIZE CONFIGURATIONS
Make your jobs and notebooks region-aware:
- Use configuration variables or Databricks secrets for storage paths, endpoints, and connection strings
- During failover, update these parameters to point to secondary region resources
- This avoids hardcoded region-specific values that would require code changes during DR


TESTING YOUR DR PLAN

This is the most important part -- a DR plan that has not been tested is not a real DR plan.

FAILOVER TEST PROCEDURE:
1. Gracefully shut down primary region workloads and let all running jobs complete
2. Verify the secondary region is unaffected by the simulated outage
3. Sync the latest data and workspace state from primary to secondary
4. Disable primary region pools and clusters to prevent accidental processing
5. Activate the secondary region: start pools/clusters, set job concurrency back to normal
6. Update external tool configurations (URLs, JDBC/ODBC connections, API endpoints)
7. Validate end-to-end data flow in the secondary region
8. Notify users of the new workspace URL

FAILBACK TEST PROCEDURE:
1. Confirm the primary region is restored and healthy
2. Disable secondary region pools and clusters
3. Sync any data and workspace changes made in the secondary region back to primary
4. Update all connections to point back to the primary region
5. Resume normal operations in primary
6. Re-establish the secondary region sync for future DR readiness

TESTING TIPS:
- Schedule DR tests at least twice per year
- Start with tabletop exercises (walk through the plan without executing)
- Progress to partial failovers (test individual components)
- Eventually run full failover/failback drills
- Document every issue encountered and update your runbook
- Measure actual RTO and RPO during tests and compare to your targets
- Test that monitoring and alerting works in the secondary region


IMPORTANT LIMITATIONS TO KNOW

- DR does NOT protect against data corruption. Corrupted data will replicate to your secondary region. Use Delta time travel (table history) for data corruption recovery instead.
- Object IDs differ between workspaces, so you need to maintain ID mappings for ACLs and cross-references.
- Mount points may need different storage endpoints in the secondary region.
- Unity Catalog considerations: Each region has its own metastore. Managed tables cannot be registered across multiple metastores. For cross-region data access, use Delta Sharing or external tables. Access controls and lineage graphs are per-metastore and do not cross region boundaries, so these must be replicated separately.


UNITY CATALOG DR CONSIDERATIONS

If you use Unity Catalog, keep in mind:
- Each region has its own metastore, so you will need a metastore in us-west-2 as well
- Managed tables are bound to a single metastore, so use external tables if you need DR flexibility
- Use Databricks-to-Databricks Delta Sharing for cross-region data access
- Access controls (grants) and lineage are scoped to the metastore level and must be recreated
- Egress charges apply for cross-region data movement, so plan replication costs accordingly


DOCUMENTATION REFERENCES

- Disaster Recovery overview and architecture:
https://docs.databricks.com/aws/en/admin/disaster-recovery

- Delta Deep Clone for cross-region replication:
https://docs.databricks.com/en/delta/clone.html

- Databricks Terraform Provider:
https://registry.terraform.io/providers/databricks/databricks/latest/docs

- Databricks Sync (DBSync) tool:
https://github.com/databrickslabs/databricks-sync

- Databricks REST API reference (for custom automation):
https://docs.databricks.com/api/workspace/introduction

- Unity Catalog best practices (includes DR guidance):
https://docs.databricks.com/en/data-governance/unity-catalog/best-practices.html

- AWS S3 Cross-Region Replication:
https://docs.aws.amazon.com/AmazonS3/latest/userguide/replication.html


I hope this gives you a solid framework to work from. If you share more details about your specific setup (whether you use Unity Catalog, streaming workloads, how many jobs you run, etc.), the community can help you refine the plan further.

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.