Re: Moving data from commercial to gov cloud insta...

SteveOstrowski · ‎03-07-2026

@databricks_use2 -- Happy to help with this one. Moving from commercial to GovCloud with a full SDP medallion architecture requires careful planning. Here is a step-by-step approach that covers both the data migration and the pipeline rebuild.

THE SHORT ANSWER

You will NOT be able to simply copy your SDP checkpoints from commercial to GovCloud and resume. Checkpoints contain cloud-specific paths, storage URIs, file identifiers, and streaming offsets that are tied to the source environment. The recommended approach is: migrate the data, redeploy the pipeline definitions, and start fresh checkpoints at cutover.

STEP 1: MIGRATE YOUR DATA WITH DEEP CLONE

Use Delta DEEP CLONE to copy all of your tables (bronze, silver, and gold) to the new GovCloud storage location. Deep clone copies both the data files and the Delta transaction log metadata.

For each table, run something like:

CREATE TABLE gov_catalog.schema.bronze_table
DEEP CLONE commercial_catalog.schema.bronze_table;

Key details about deep clone:
- It copies schema, partitioning, invariants, nullability, and streaming metadata
- For SCD2 silver tables, deep clone will preserve the full history of rows (all your _start_date, _end_date, _is_current columns will be intact)
- Change Data Feed history is NOT preserved in clones -- the cloned table starts a fresh CDF history
- Table descriptions and user-defined commit metadata are not cloned

Since cross-cloud deep clone requires network connectivity between the two environments, you may need to do this in stages:

Option A -- If you can mount or access both storage accounts from a single workspace:
Run the DEEP CLONE SQL directly.

Option B -- If the environments are fully isolated:
1. Export the data using a cloud-native tool (e.g., azcopy, aws s3 sync, or similar) to move the underlying Parquet/Delta files to the GovCloud storage
2. Register the tables in the GovCloud metastore using CREATE TABLE ... LOCATION

Documentation: https://docs.databricks.com/en/delta/clone.html

STEP 2: REDEPLOY YOUR PIPELINE DEFINITIONS

Your SDP pipeline code (the SQL or Python notebook definitions) should be redeployed to the new GovCloud workspace. A few approaches:

1. Databricks Asset Bundles (recommended): If you are using infrastructure-as-code, define your pipeline in a bundle YAML and deploy to the GovCloud workspace. This is the cleanest approach for CI/CD across environments.
Documentation: https://docs.databricks.com/en/dev-tools/bundles/index.html

2. Databricks Terraform Provider: Export your pipeline configuration and redeploy via Terraform to the new workspace.

3. Databricks Labs Migrate Tool: An open-source tool that can migrate workspace objects (notebooks, jobs, clusters, users, secrets) between workspaces, including cross-cloud. It does NOT migrate DBFS data directly, but handles the workspace configuration layer.
GitHub: https://github.com/databrickslabs/migrate

4. Manual: Export your notebooks and import them into the new workspace, then recreate the pipeline configuration through the UI or API.

STEP 3: HANDLE THE CUTOVER FOR AUTOLOADER AND STREAMING TABLES

This is the critical part. Since checkpoints cannot be reused, you need a clean cutover strategy:

1. STOP the pipeline in commercial at a known point in time (note the timestamp).

2. Make sure your deep clone of bronze tables is complete and up to date as of that timestamp.

3. In GovCloud, start the SDP pipeline with FRESH checkpoints. The pipeline will want to reprocess data. You have two options to avoid reprocessing everything:

Option A -- Full Refresh from Cloned Bronze:
- Point your silver/gold SDP definitions at the cloned bronze tables
- Do a full refresh -- this rebuilds silver and gold from bronze
- Since you already deep-cloned bronze, this just reprocesses the transformations (not re-ingesting from source)
- For SCD2 tables, a full refresh will rebuild the history from bronze

Option B -- Use Cloned Silver/Gold and Only Process New Data:
- Deep clone bronze, silver, AND gold tables
- When starting the new SDP pipeline, configure your Autoloader source to only pick up NEW files arriving after the cutover timestamp
- You can use the Autoloader option modifiedAfter to filter:

.option("modifiedAfter", "2026-03-07T00:00:00.000Z")

- This avoids reprocessing historical data entirely
- For SCD2 silver tables, new changes will be applied on top of the cloned history

4. Once the GovCloud pipeline is running and processing new data successfully, decommission the commercial pipeline.

SPECIAL CONSIDERATIONS FOR SCD2 TABLES

Since your silver tables use SCD Type 2 (APPLY CHANGES with SCD Type 2 in SDP), keep in mind:

- Deep clone preserves all the historical rows, so your SCD2 history is safe
- If you do a full refresh of the SDP pipeline in GovCloud, it will rebuild SCD2 history from bronze -- this should produce identical results if your bronze data is complete
- If you use Option B (clone silver and process only new data), your existing SCD2 rows are preserved and new changes are applied going forward

ADDITIONAL TOOLS AND CONSIDERATIONS

- Delta Sharing: If you need ongoing data synchronization between commercial and GovCloud (not just a one-time migration), Delta Sharing supports cross-cloud sharing between Databricks accounts. Be aware of data egress costs.
Documentation: https://docs.databricks.com/en/delta-sharing/index.html

- Disaster Recovery Guide: The Databricks DR documentation covers active-passive patterns that are relevant to cross-region/cross-cloud scenarios and recommends Deep Clone as the primary data replication mechanism.
Documentation: https://docs.databricks.com/en/admin/disaster-recovery.html

- Auto Loader Checkpoint Details: Checkpoints store file discovery state in RocksDB. They track which files have been processed to guarantee exactly-once semantics. This state is environment-specific and not portable across clouds.
Documentation: https://docs.databricks.com/en/ingestion/cloud-object-storage/auto-loader/production.html

- SDP Full Refresh: You can selectively refresh specific tables or reset checkpoints for specific flows using the REST API (reset_checkpoint_selection parameter) without clearing all data.
Documentation: https://docs.databricks.com/en/delta-live-tables/updates.html

RECOMMENDED MIGRATION SEQUENCE SUMMARY

1. Deploy pipeline code to GovCloud workspace (Asset Bundles, Terraform, or manual)
2. Deep clone all bronze, silver, and gold tables to GovCloud storage
3. Stop the commercial pipeline at a known cutover time
4. Final incremental deep clone to catch any last changes
5. Start the GovCloud pipeline with Autoloader filtering for files after cutover
6. Validate data in GovCloud matches expectations
7. Decommission commercial pipeline

Hope this helps -- this is a common pattern for cross-cloud migrations and the key insight is to treat the data and the pipeline definitions as two separate migration workstreams.

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.