Databricks Community

APJESK · ‎02-24-2026

Can you share detailed steps or document for DR setup.

Example Consider Workspace A running in us-east-1, and planning to setup DR on us-west-2

So what are the steps and Task I need to create on AWS us-west-2.

Please share in detail steps that required at AWS level and Databricks Level

Thanks

nayan_wylde · ‎02-25-2026

Decide Your DR Strategy (Very Important First Step)

Before creating tasks, align on RPO/RTO:

Component	Recommended DR Strategy
Workspace	Cold or Warm Standby (separate workspace in DR region)
Metadata (Unity Catalog)	Multi‑region metastore OR restore from backup
Data (Delta/S3)	Cross‑region replication (CRR)
Notebooks / Jobs	Git-based sync
Secrets	Re-created or synced
Compute	Recreated on demand

AWS-Level Setup (us‑west‑2)

2.1 Networking (VPC)

Create a mirror VPC in us-west-2.

Tasks

Create VPC with same CIDR structure (or non‑overlapping if peered)
Private subnets for:

Databricks compute
PrivateLink endpoints

NAT Gateway (for egress)
Route tables identical to primary

2.2 VPC Endpoints (Critical)

Create these Interface Endpoints in us‑west‑2:

com.amazonaws.us-west-2.s3
com.amazonaws.us-west-2.sts
com.amazonaws.us-west-2.kinesis (if used)
com.amazonaws.us-west-2.logs
com.amazonaws.us-west-2.monitoring

For Databricks:

Control plane endpoints provided during workspace creation

2.3 IAM (Cross-Region Consistency)

Create identical IAM roles as primary:

a) Databricks Cross-Account Role

Trusts Databricks AWS account
Same policy permissions:

S3 access
EC2
IAM PassRole
Logs, KMS

b) Instance Profile Role

Same permissions as primary
Attached to EC2 instances

2.4 S3 Data Layer (Most Important Part)

Option A: Centralized S3 (Recommended)

One S3 bucket (e.g., s3://org-datalake-prod)
Accessible from both regions
No replication needed
Lower complexity

Option B: Cross-Region Replication (CRR)

If data residency or latency requires regional buckets:

Tasks

Enable S3 CRR:

us-east-1 → us-west-2

Replicate:

Delta tables
_delta_log

Enable:

Versioning
Replication for delete markers

2.5 KMS (If Encryption Enabled)

Create KMS key in us‑west‑2
Update IAM policies to allow use
Use region‑specific CMKs (KMS keys cannot be shared cross‑region)

Databricks-Level Setup (us‑west‑2)

3.1 Create DR Databricks Workspace

Tasks

Create workspace in us-west-2
Attach to DR VPC
Use:

Same workspace name + -dr suffix
Same account console

Enable:

Unity Catalog
E2 networking
PrivateLink (if used)

3.2 Unity Catalog DR Strategy

Option 1: Single Metastore (Advanced)

One UC metastore
Assign both workspaces
Data accessible cross‑region

Option 2: Separate Metastore (Most Common)

Tasks

Create new UC metastore in us-west-2
Assign DR workspace
Create:

Same catalogs
Same schemas
Same external locations

Recommended for strict regional isolation

3.3 External Locations & Storage Credentials

Recreate exactly as primary:

Storage Credentials
External Locations
Grants

Use automation (Terraform / Databricks CLI)

Replicating Databricks Assets

4.1 Notebooks & Repos

Git is mandatory

Tasks

All notebooks stored in Git repos
Same repos configured in DR workspace
No manual notebooks in workspace

4.2 Jobs & Workflows

Recreate jobs using:

Databricks Terraform Provider (best)
Or Databricks Jobs API

4.3 Secrets

Secrets are not replicated automatically

Options

Recreate manually in DR
Sync from:

AWS Secrets Manager
HashiCorp Vault

4.4 MLflow & Models

Store artifacts in:

Replicated S3 bucket

Recreate:

Registered models
Permissions

DR Operations Runbook

5.1 Failover Procedure (Primary → DR)

Trigger Conditions

us‑east‑1 outage
Databricks control plane unavailable
Business decision

Steps

Freeze ingestion in primary (if possible)
Validate S3 replication status
Enable jobs in DR workspace
Update:

DNS / application endpoints
Airflow / ADF / external schedulers

Validate:

Unity Catalog access
Delta table reads/writes
Downstream consumers

5.2 Failback (DR → Primary)

Stop DR ingestion
Allow replication to catch up
Re-enable primary jobs
Validate data consistency
Disable DR jobs

SteveOstrowski · ‎03-07-2026

@APJESK

I have seen this pattern before. disaster recovery planning for Databricks on AWS is a critical topic and one that Databricks has solid documentation and tooling around. Let me walk you through a comprehensive approach to implementing and testing a DR plan.

UNDERSTANDING THE KEY CONCEPTS

Before diving in, it helps to clarify the distinction Databricks makes between High Availability (HA) and Disaster Recovery (DR):

- High Availability is handled within a single region. The Databricks control plane is already resilient to availability zone failures and can automatically recover within about 15 minutes. Compute clusters will restart in a different AZ if their current zone fails.

- Disaster Recovery addresses regional outages and requires explicit planning on your part. This is where your DR plan comes in.

You will also want to define your Recovery Point Objective (RPO) -- the maximum acceptable data loss window -- and your Recovery Time Objective (RTO) -- the maximum acceptable downtime. These will drive your architecture decisions.

CHOOSING A DR STRATEGY

There are two primary patterns:

1. Active-Passive (Recommended for most customers)
- You run production workloads in your primary region
- A secondary workspace in another AWS region is kept synchronized but idle
- During a regional outage, you failover to the secondary workspace
- This is simpler, cheaper, and has a straightforward failover/failback process

2. Active-Active (For maximum availability)
- Both regions run workloads simultaneously
- Jobs are only marked complete after successful execution in BOTH regions
- Requires strict CI/CD pipelines and is more expensive
- Best for organizations with near-zero RTO/RPO requirements

IMPLEMENTATION STEPS

Here is a phased approach:

PHASE 1: PLANNING
- Define your RPO and RTO requirements
- Map all integration points (data sources, downstream consumers, external tools)
- Identify a secondary AWS region that supports all required services (EC2, S3, etc.)
- Plan your communication strategy for failover events

PHASE 2: WORKSPACE REPLICATION
You need to replicate workspace objects to your secondary region. Here is what to sync and how:

Object Type Recommended Approach
---------------------------------------------------------------
Notebook source code CI/CD co-deployment to both regions
Users and Groups Same IdP for both, or SCIM automation
Jobs Deploy to secondary with concurrency=0
Cluster configs Templates in Git, co-deploy
Libraries Source control and cluster templates
Secrets Create in both workspaces via API
Access Controls Co-deploy ACLs via API with ID mapping
Init scripts Store in cloud storage, NOT DBFS root

Key tools for workspace replication:
- Databricks Terraform Provider: Infrastructure-as-code approach to deploy identical workspace configurations across regions. This is the most production-grade approach.
- Databricks Sync (DBSync): Open-source tool from Databricks Labs for backup, restore, and sync of workspace objects. Supports clusters, jobs, notebooks, instance pools, secrets, users, and groups.
GitHub: https://github.com/databrickslabs/databricks-sync
(Note: DBSync is provided for exploration and is not formally supported with SLAs.)
- Databricks REST APIs: For custom automation of object replication.

PHASE 3: DATA REPLICATION
Your data in S3 needs to be available in the secondary region:

- For Delta tables, use Delta Deep Clone for cross-region replication:

CREATE OR REPLACE TABLE dr_region.schema.my_table
CLONE primary_region.schema.my_table;

Deep clones can be run incrementally to sync only new changes.

- For raw data in S3, use AWS S3 Cross-Region Replication (CRR) to keep buckets synchronized.

- IMPORTANT: Do NOT rely solely on S3 geo-redundant storage for DR. Use explicit replication mechanisms.

- Do NOT store production data in the DBFS root bucket -- use external locations in S3 that you control and can replicate.

PHASE 4: STREAMING CONSIDERATIONS
If you use Structured Streaming, special attention is needed:
- Checkpoints contain location-specific metadata
- Store checkpoints in customer-managed S3 (not DBFS) so they can be replicated
- Consider running parallel streaming jobs in the secondary region
- Parameterize source/sink configurations so they can be swapped for DR endpoints

PHASE 5: PARAMETERIZE CONFIGURATIONS
Make your jobs and notebooks region-aware:
- Use configuration variables or Databricks secrets for storage paths, endpoints, and connection strings
- During failover, update these parameters to point to secondary region resources
- This avoids hardcoded region-specific values that would require code changes during DR

TESTING YOUR DR PLAN

This is the most important part -- a DR plan that has not been tested is not a real DR plan.

FAILOVER TEST PROCEDURE:
1. Gracefully shut down primary region workloads and let all running jobs complete
2. Verify the secondary region is unaffected by the simulated outage
3. Sync the latest data and workspace state from primary to secondary
4. Disable primary region pools and clusters to prevent accidental processing
5. Activate the secondary region: start pools/clusters, set job concurrency back to normal
6. Update external tool configurations (URLs, JDBC/ODBC connections, API endpoints)
7. Validate end-to-end data flow in the secondary region
8. Notify users of the new workspace URL

FAILBACK TEST PROCEDURE:
1. Confirm the primary region is restored and healthy
2. Disable secondary region pools and clusters
3. Sync any data and workspace changes made in the secondary region back to primary
4. Update all connections to point back to the primary region
5. Resume normal operations in primary
6. Re-establish the secondary region sync for future DR readiness

TESTING TIPS:
- Schedule DR tests at least twice per year
- Start with tabletop exercises (walk through the plan without executing)
- Progress to partial failovers (test individual components)
- Eventually run full failover/failback drills
- Document every issue encountered and update your runbook
- Measure actual RTO and RPO during tests and compare to your targets
- Test that monitoring and alerting works in the secondary region

IMPORTANT LIMITATIONS TO KNOW

- DR does NOT protect against data corruption. Corrupted data will replicate to your secondary region. Use Delta time travel (table history) for data corruption recovery instead.
- Object IDs differ between workspaces, so you need to maintain ID mappings for ACLs and cross-references.
- Mount points may need different storage endpoints in the secondary region.
- Unity Catalog considerations: Each region has its own metastore. Managed tables cannot be registered across multiple metastores. For cross-region data access, use Delta Sharing or external tables. Access controls and lineage graphs are per-metastore and do not cross region boundaries, so these must be replicated separately.

UNITY CATALOG DR CONSIDERATIONS

If you use Unity Catalog, keep in mind:
- Each region has its own metastore, so you will need a metastore in us-west-2 as well
- Managed tables are bound to a single metastore, so use external tables if you need DR flexibility
- Use Databricks-to-Databricks Delta Sharing for cross-region data access
- Access controls (grants) and lineage are scoped to the metastore level and must be recreated
- Egress charges apply for cross-region data movement, so plan replication costs accordingly

DOCUMENTATION REFERENCES

- Disaster Recovery overview and architecture:
https://docs.databricks.com/aws/en/admin/disaster-recovery

- Delta Deep Clone for cross-region replication:
https://docs.databricks.com/en/delta/clone.html

- Databricks Terraform Provider:
https://registry.terraform.io/providers/databricks/databricks/latest/docs

- Databricks Sync (DBSync) tool:
https://github.com/databrickslabs/databricks-sync

- Databricks REST API reference (for custom automation):
https://docs.databricks.com/api/workspace/introduction

- Unity Catalog best practices (includes DR guidance):
https://docs.databricks.com/en/data-governance/unity-catalog/best-practices.html

- AWS S3 Cross-Region Replication:
https://docs.aws.amazon.com/AmazonS3/latest/userguide/replication.html

I hope this gives you a solid framework to work from. If you share more details about your specific setup (whether you use Unity Catalog, streaming workloads, how many jobs you run, etc.), the community can help you refine the plan further.

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.

Databricks Community

Implement & Test DR Plan in AWS databricks

DAIS 2026 | Day 3 Recap: That's a wrap. Empty boxes & full hearts.

‌✨‌ DAIS 2026 Community Virtual Contest – Winners Announced! 🏆

🌟 Community Pulse: Your Weekly Roundup! June 08 – 14, 2026

Solution Accelerator Series | Building a Chatbot With Large Language Models (LLMs)

Build apps without jumping through hoops