Databricks Community

Ruby8376 · ‎08-31-2023

Currently, in our company we are using ADF+DATABRICKS for all batch integration. Using ADF first data is copied to ADLS gen 2 (from different sources like on prem servers, ftp solution file sharing, etc), then it is reformatted to csv and it is copied to delta lake where all transformation and merging of data happens. Data is loaded to salesforce using custom bulk api v2 connector. We plan to remove this additional step of first copying data to ADLS gen 2 and rather do everything end to end ETL on databricks . could you please provide architecture diagram or pros/cons for this upgrade? fee free to ask question.

since it is all banking data, security might be a concern.

-werners- · ‎09-05-2023

the benefit is you only use a single system instead of one. But that's about it IMO.
Data Factory is cheaper and works great for ingest purposes. It has a lot of connectors and works basically serverless.
If Databricks would offer such a product (serverless cheap ingest with lots of connectors), I wouldn't hesitate and kick ADF out.
But for now we keep on using it.

View solution in original post

Ruby8376 · ‎08-31-2023

Also another thing most of the systems that we connect to are on on-prem servers, so we use selft hosted integration runtime in ADF. How can this be done in databricks?

shrikant_kulkar · ‎09-01-2023

Hey Ruby,

Here are few options for you

1. Use Databricks partner connect (fivetran, qlik replicate etc) to send data from on-prem directly into raw / bronze delta tables - there is addtional cost for these software's.

2. Set up connection between databricks workspace and on prem network using express route - this will allow you to use external connection and foreign catalog to be created directly on your on-prem data sources

3. Use custom technique to send data from on-prem data-sources to event hub / kafka streams, then use spark structured streaming / delta live tables to ingest data into your bronze zone.

4. Replicate your on-prem data sources to cloud - azure offers different replication options to cloud for most of database technologies. You can then use option #2 to connect to databases directly from your databricks environment.

Each option comes with additional cost, security, restorability, recovery constraints. We have same architecture as yours, we use autoloader to ingest data into bronze tables and also run micro-batch upsets on the stream to update raw tables based on primary keys. For every source table - we have two tables one running as append only and another with merge. its offering us good flexibility in terms of reprocessing, recovery, running multiple tables in parallel in loop using ADF orchestration. Adding new datasources and schema is very easy as its templatized.

Ruby8376 · ‎09-01-2023

Is there any benefit of doing the extract part in databricks itself? Unlike our current architecture, where we first load to adls using adf. I guess it is worth doing all end to end using databricks if there is better processing, lower latency, etc

Ruby8376 · ‎09-04-2023

@shrikant_kulkar Is there any benefit of doing the extract part in databricks itself? Unlike our current architecture, where we first load to adls using adf. I guess it is worth doing all end to end using databricks if there is better processing, lower latency, etc

Ruby8376 · ‎09-04-2023

@-werners- Is there any benefit of doing the extract part in databricks itself? Unlike our current architecture, where we first load to adls using adf. I guess it is worth doing all end to end using databricks if there is better processing, lower latency, etc

-werners- · ‎09-05-2023

the benefit is you only use a single system instead of one. But that's about it IMO.
Data Factory is cheaper and works great for ingest purposes. It has a lot of connectors and works basically serverless.
If Databricks would offer such a product (serverless cheap ingest with lots of connectors), I wouldn't hesitate and kick ADF out.
But for now we keep on using it.

Databricks Community

Using databricks for end to end flow? rather than using ADF for extracting data

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks

Databricks Community Champion - December 2024 - Sujesh Menon

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences