cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Anti pattern : moving data from cloud to on-prem

Ruby8376
Valued Contributor

Hi there,

In my current project,

Current status: Az databricks streaming jobs migrate Json file from kafka to raw layer(parquet file), then parsing logic is applied and 8 tables are created in raw standardized layer.

Requirement: Business team wants to access this data from on prem sql server and hence propose to follow an anti pattern(moving data from cloud to on-prem) as they cannot access databricks due to technical inability.

How can this be achieved? Using ADF?

1 ACCEPTED SOLUTION

Accepted Solutions

-werners-
Esteemed Contributor III

You could indeed use ADF to copy the data from cloud to on-prem.

However, depending on the size of the data, this can take a while.

I use the same pattern, but for aggregated processed data, which is not an issue at all.

You could also look at Azure Synapse Serverless or Delta Sharing, or even SQL Hybrid.

But if the data MUST be on-prem, a copy seems the only way.

In that case I'd try to use Polybase for the copy (I think recent versions of Sql Server can use this).

But I would strongly suggest to not move raw data to on-prem.

There are several possibilities to query cloud data using SQL (maybe not SQL server but Azure Synapse Serverless also uses t-sql f.e.)

View solution in original post

9 REPLIES 9

Ruby8376
Valued Contributor

@Werner Stinckens​ can u help?

-werners-
Esteemed Contributor III

You could indeed use ADF to copy the data from cloud to on-prem.

However, depending on the size of the data, this can take a while.

I use the same pattern, but for aggregated processed data, which is not an issue at all.

You could also look at Azure Synapse Serverless or Delta Sharing, or even SQL Hybrid.

But if the data MUST be on-prem, a copy seems the only way.

In that case I'd try to use Polybase for the copy (I think recent versions of Sql Server can use this).

But I would strongly suggest to not move raw data to on-prem.

There are several possibilities to query cloud data using SQL (maybe not SQL server but Azure Synapse Serverless also uses t-sql f.e.)

Ruby8376
Valued Contributor

Hey thank you so much @Werner Stinckens​ .

Yea it is a must to move data to on -prem. Can you please share links/guide for the same?

i was going through below link, looks like it is not going to be direct copy from delta lake. I would need to use an interim Azure storage instance(staged copy from delta lake)?

https://docs.microsoft.com/en-us/azure/data-factory/connector-azure-databricks-delta-lake?tabs=data-...

-werners-
Esteemed Contributor III

I think so. The staged copy is probably a parquet version or something of the delta lake table (I do not copy delta lake to on prem rdmbs at the moment).

If the tables to be copied are not Delta Lake but Parquet, the staging is not necessary.

The staging does not take a lot of time though, certainly not compared to inserting in SQL Server!

Ruby8376
Valued Contributor

Currently, data is being read from json file and landed into raw layer in parquet format, after that 1 table is created where entire data is inserted (1 common raw standardised table)- then using parsing logic different tables are created.

what would be your suggestion to move this data to on prem rdbms?

-werners-
Esteemed Contributor III

It depends on the use case. If your colleagues want to do ad hoc analysis on these parsed tables, then yes. But if they have a specific use case (or several), I'd prepare/transform/aggregate the data first and send that to sql server.

Ruby8376
Valued Contributor

Agree! My only concern is the parsing logic that will have to apply while writing data to on prem sql server. Should we move this data to az sql first?

-werners-
Esteemed Contributor III

I'd apply all logic in databricks/spark as there you have the advantage of parallel processing. Write the prepared data to AZ, so no transformations have to be done in the rdbms.

Kaniz
Community Manager
Community Manager

Hi @Ruby Rubi​ , We haven’t heard from you on the last response from @Werner Stinckens​, and I was checking back to see if his suggestions helped you.

Or else, If you have any solution, please do share that with the community as it can be helpful to others.

Also, Please don't forget to click on the "Select As Best" button whenever the information provided helps resolve your question.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.