โ11-24-2022 08:18 AM
Hi Liliana,
During the user group discussion, there was a mentioned regarding multi-cloud implementation with Databricks. If a workload fails in 1 cloud (say in Azure), then it will run on another cloud vendor (say in AWS).
I imagine the storages across the clouds will have to be synced (maybe with Delta deep clone?) and how does it work if managed tables are used?
If you have any reference articles or documentation I can review? I would like to gain more insight on how this is designed/implemented.
โ12-01-2022 01:30 PM
Hi Wilson! Great question, and yes, you will be able to achieve that through Deep Clone: https://www.databricks.com/blog/2021/04/20/attack-of-the-delta-clones-against-disaster-recovery-avai... Please give this a read because it outlines in detail the whole process!
Additionally, this blog post does a great job describing how we built a lakehouse across multi-cloud in a high level: https://www.databricks.com/blog/2021/07/14/petabyte-scale-data-processing-across-multiple-cloud-plat...
โ12-01-2022 01:30 PM
Hi Wilson! Great question, and yes, you will be able to achieve that through Deep Clone: https://www.databricks.com/blog/2021/04/20/attack-of-the-delta-clones-against-disaster-recovery-avai... Please give this a read because it outlines in detail the whole process!
Additionally, this blog post does a great job describing how we built a lakehouse across multi-cloud in a high level: https://www.databricks.com/blog/2021/07/14/petabyte-scale-data-processing-across-multiple-cloud-plat...
โ12-12-2022 09:32 AM
Thank you for the links!
โ09-11-2023 04:42 AM
Hey there, Wilson-Mok!
Multi-cloud implementation with Databricks is a captivating endeavor, is not it? To achieve this, you're on the right track thinking about data synchronization.
One method to harmonize data between clouds is to employ a data replication tool, perhaps leveraging a combination of Delta Lake and external tools like Apache Airflow or dbt (data build tool). Delta Lake's deep clone can indeed be useful for keeping your data in sync. You can periodically replicate data from one cloud's storage to another using this approach.
However, when it comes to managed tables, things can get a bit tricky. You'll need to ensure that the metastore where your managed tables' metadata is stored is accessible from both clouds. Additionally, consider using a mechanism like the Hive metastore for a unified metadata repository. You can also find some smart thoughts here: Cloud Data Migration Challenges: Explore 6 Best Strategies in 2023.
As for reference material, the professional Databricks documentation often includes valuable insights and exceptional practices for multi-cloud setups. You may additionally discover community forums and blogs to analyze from real-world experiences.
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now