cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

High Level Design for Transfer Data from One Databricks account to Another databricks account

Datalight
New Contributor III

Hi,

May someone please help me with only Points which should be part of High Level Design and Low Level Design when transfering Data from One Databricks account to Another databricks account using Unity Catalog. First time full data transfer and than only Incremental Load.

Please help me with points which all should be part of HLD and than LLD.

Thanks a lot.

1 ACCEPTED SOLUTION

Accepted Solutions

szymon_dybczak
Esteemed Contributor III

Again, if your data is written to account A the changes should be visible nearly in real-time in account B. This discussion won't make any sense if you don't dedicate 1 hour of your life and read about fundamental concepts. So , please watch YT videos that I've provided or read documentation.

View solution in original post

10 REPLIES 10

szymon_dybczak
Esteemed Contributor III

Hi @Datalight ,

In this scenario (account to account and I'm assuming that they are at different metastores )  it's recommended to use Delta Sharing. Key features of delta sharing:

  1. Open Protocol: Allows sharing data across different platforms, including Databricks, Snowflake, Apache Spark, and pandas.
  2. Real-time Data Access: Consumers always access the latest data without needing ETL pipelines or data duplication.
  3. Fine-Grained Access Control: With Unity Catalog, you can manage permissions at the catalog, schema, table, or even row level.
  4. Cross-Cloud Sharing: You can share data across different cloud providers (Azure, AWS, and GCP) or different databricks account.
  5. No Data Replication: Consumers query the shared Delta table directly from its storage location.

You can read about it here:

What is Delta Sharing? - Azure Databricks | Microsoft Learn

You can read about how to setup delta sharing at below link:

Set up Delta Sharing for your account (for providers) - Azure Databricks | Microsoft Learn

Regarding incremental loading, Delta Sharing support for sharing the Change Data Feed for Delta tables. This is an excellent way for data recipients to keep track of incremental changes as they occur by the data provider. Data recipients may now read only the changes that have been made to a table, rather than having to re-read the entire dataset to get the latest snapshot.

You read about it at below article:

Use Delta Lake change data feed on Azure Databricks - Azure Databricks | Microsoft Learn

Regarding high level design - it would involve:

On Account A (Provider)

-  Create a share and add the tables you want to share

-  Create a recipient (if databricks-to-databricks): either create recipient object or let the recipient request access and you approve. You can set it to a particular workspace or external identity. See docs for exact steps.
- Grant the recipient USE on the share (or accept their request). The recipient will receive access to the live table metadata and data through Delta Sharing.

On Account B (Recipient)

- Connect to the provider share (Catalog Explorer โ†’ Delta Sharing โ†’ Add provider or accept provider invite). This mounts the provider share as a read-only catalog and you can query as a table.

For low level design just refer to documentation, there's no better source:

Share data using the Delta Sharing Databricks-to-Databricks protocol (for providers) - Azure Databri...

Create and manage data recipients for Delta Sharing (Databricks-to-Databricks sharing) - Azure Datab...

Read data shared using Databricks-to-Databricks Delta Sharing (for recipients) - Azure Databricks | ...

Reference: Solved: Re: Data Transfer using Unity Catalog full impleme... - Databricks Community - 128218

@szymon_dybczak : Thanks a lot. May you please help me how can Persists data On Account B (Recipient).

I have to push data to Recipient ADLS Gen2.

Please share your thoughts.

Datalight
New Contributor III

Do I need to execute query for Inserting data through Orchestrator tool, either ADF or Databricks workflow.

Kindly share your thoughts.

szymon_dybczak
Esteemed Contributor III

The whole idea of delta sharing is that you write your data into your own account and create a share. Then the recipient B can read from that share. 
Maybe try to watch below video to grasp the general idea:

(235) Delta Sharing in Action: Architecture and Best Practices - YouTube

Databricks Delta Sharing Demo

(235) 39 Delta Sharing in Databricks | Databricks Express Edition | Share Data outside your Organiza...

@szymon_dybczak : Thanks : Is there any way to do write operation on local of recipient. Pardon me it is sound illogical.

Many Thanks

szymon_dybczak
Esteemed Contributor III

If I understood your question correctly, you're asking if recipient of a share can write/update data on share? If so, unfortunately this is not possible.

So, you have an access to all the data that provider (let's say account A) has shared with you. Any update that provider A will do to that data will be available for receipient (B) in near-real time. But Recipient cannot modify shared data itself.

 

szymon_dybczak_0-1756111838318.png

 

.

@szymon_dybczak : Hi,

Datalight_0-1756113582946.png

Here A and B are two different account of Databricks.

Whenever new Data came to A (ADLS Gen2) , Automatically it should push the new Data to B (ADLS Gen2).

Both A and B are UC enable.

May you please help me with detail step, how can I achieve this.

Do I need to Orchestrate the pipeline with Databricks workflow or ADF.

Kindly share your thoughts. 

szymon_dybczak
Esteemed Contributor III

Again, if your data is written to account A the changes should be visible nearly in real-time in account B. This discussion won't make any sense if you don't dedicate 1 hour of your life and read about fundamental concepts. So , please watch YT videos that I've provided or read documentation.

Take a look at "delta sharing" and "deep cloning". I implemented a kind of solution for Disaster Recovery between regions using those features, in my case under same account but that could be helpful in your case as well. Take into account that DEEP CLONE works incrementally. KR.

https://www.youtube.com/@CafeConData

Coffee77
New Contributor III

Based on my previous reply, you can use DEEP CLONE to clone data incrementally between workspaces by including it in a scheduled job but this will not work in real time indeed.

https://www.youtube.com/@CafeConData

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now