cancel
Showing results for 
Search instead for 
Did you mean: 
Community Platform Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results for 
Search instead for 
Did you mean: 

Using databricks for end to end flow? rather than using ADF for extracting data

Ruby8376
Valued Contributor

Currently, in our company we are using ADF+DATABRICKS for all batch integration. Using ADF first data is copied to ADLS gen 2 (from different sources like on prem servers, ftp solution file sharing, etc), then it is reformatted to csv and it is copied to delta lake where all transformation and merging of data happens. Data is loaded to salesforce using custom bulk api v2 connector. We plan to remove this additional step of first copying data to ADLS gen 2 and rather do everything end to end ETL on databricks . could you please provide architecture diagram or pros/cons for this upgrade? fee free to ask question.

since it is all banking data, security might be a concern.

1 ACCEPTED SOLUTION

Accepted Solutions

-werners-
Esteemed Contributor III

the benefit is you only use a single system instead of one.  But that's about it IMO.
Data Factory is cheaper and works great for ingest purposes. It has a lot of connectors and works basically serverless.
If Databricks would offer such a product (serverless cheap ingest with lots of connectors), I wouldn't hesitate and kick ADF out.
But for now we keep on using it.

View solution in original post

6 REPLIES 6

Ruby8376
Valued Contributor

Also another thing most of the systems that we connect to are on on-prem servers, so we use selft hosted integration runtime in ADF. How can this be done in databricks?

shrikant_kulkar
New Contributor III

Hey Ruby, 

Here are few options for you

1. Use Databricks partner connect (fivetran, qlik replicate etc) to send data from on-prem directly into raw / bronze delta tables - there is addtional cost for these software's. 

2. Set up connection between databricks workspace and on prem network using express route - this will allow you to use external connection and foreign catalog to be created directly on your on-prem data sources 

3.  Use custom technique to send data from on-prem data-sources to event hub / kafka streams, then use spark structured streaming / delta live tables to ingest data into your bronze zone. 

4. Replicate your on-prem data sources to cloud - azure offers different replication options to cloud for most of database technologies. You can then use option #2 to connect to databases directly from your databricks environment. 

Each option comes with additional cost, security, restorability, recovery constraints. We have same architecture as yours, we use autoloader to ingest data into bronze tables and also run micro-batch upsets on the stream to update raw tables based on primary keys. For every source table - we have two tables one running as append only and another with merge. its offering us good flexibility in terms of reprocessing, recovery, running multiple tables in parallel in loop using ADF orchestration. Adding new datasources and schema is very easy as its templatized.

Is there any benefit of doing the extract part in databricks itself? Unlike our current architecture, where we first load to adls using adf. I guess it is worth doing all end to end using databricks if there is better processing, lower latency, etc

Ruby8376
Valued Contributor

@shrikant_kulkar Is there any benefit of doing the extract part in databricks itself? Unlike our current architecture, where we first load to adls using adf. I guess it is worth doing all end to end using databricks if there is better processing, lower latency, etc

Ruby8376
Valued Contributor

@-werners- Is there any benefit of doing the extract part in databricks itself? Unlike our current architecture, where we first load to adls using adf. I guess it is worth doing all end to end using databricks if there is better processing, lower latency, etc

-werners-
Esteemed Contributor III

the benefit is you only use a single system instead of one.  But that's about it IMO.
Data Factory is cheaper and works great for ingest purposes. It has a lot of connectors and works basically serverless.
If Databricks would offer such a product (serverless cheap ingest with lots of connectors), I wouldn't hesitate and kick ADF out.
But for now we keep on using it.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group