cancel
Showing results for 
Search instead for 
Did you mean: 
Edmonton
cancel
Showing results for 
Search instead for 
Did you mean: 

Multi-cloud implementation

wilson-mok
New Contributor III

Hi Liliana,

During the user group discussion, there was a mentioned regarding multi-cloud implementation with Databricks. If a workload fails in 1 cloud (say in Azure), then it will run on another cloud vendor (say in AWS).

I imagine the storages across the clouds will have to be synced (maybe with Delta deep clone?) and how does it work if managed tables are used?

If you have any reference articles or documentation I can review? I would like to gain more insight on how this is designed/implemented.

1 ACCEPTED SOLUTION

Accepted Solutions

liliana_tang
New Contributor III
New Contributor III

Hi Wilson! Great question, and yes, you will be able to achieve that through Deep Clone: https://www.databricks.com/blog/2021/04/20/attack-of-the-delta-clones-against-disaster-recovery-avai... Please give this a read because it outlines in detail the whole process!

Additionally, this blog post does a great job describing how we built a lakehouse across multi-cloud in a high level: https://www.databricks.com/blog/2021/07/14/petabyte-scale-data-processing-across-multiple-cloud-plat...

View solution in original post

3 REPLIES 3

liliana_tang
New Contributor III
New Contributor III

Hi Wilson! Great question, and yes, you will be able to achieve that through Deep Clone: https://www.databricks.com/blog/2021/04/20/attack-of-the-delta-clones-against-disaster-recovery-avai... Please give this a read because it outlines in detail the whole process!

Additionally, this blog post does a great job describing how we built a lakehouse across multi-cloud in a high level: https://www.databricks.com/blog/2021/07/14/petabyte-scale-data-processing-across-multiple-cloud-plat...

wilson-mok
New Contributor III

Thank you for the links!

stefnhuy
New Contributor III

Hey there, Wilson-Mok!

Multi-cloud implementation with Databricks is a captivating endeavor, is not it? To achieve this, you're on the right track thinking about data synchronization.

One method to harmonize data between clouds is to employ a data replication tool, perhaps leveraging a combination of Delta Lake and external tools like Apache Airflow or dbt (data build tool). Delta Lake's deep clone can indeed be useful for keeping your data in sync. You can periodically replicate data from one cloud's storage to another using this approach.

However, when it comes to managed tables, things can get a bit tricky. You'll need to ensure that the metastore where your managed tables' metadata is stored is accessible from both clouds. Additionally, consider using a mechanism like the Hive metastore for a unified metadata repository. You can also find some smart thoughts here: Cloud Data Migration Challenges: Explore 6 Best Strategies in 2023.

As for reference material, the professional Databricks documentation often includes valuable insights and exceptional practices for multi-cloud setups. You may additionally discover community forums and blogs to analyze from real-world experiences.