cancel
Showing results for 
Search instead for 
Did you mean: 
Community Platform Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results for 
Search instead for 
Did you mean: 

Oracle datawarehous Replacement With Databricks Using DeltaLake

arijit_sani
New Contributor II

I am new to Spark and DataBricks and exploring these to understand to replace Oracle DataWarehouse by DataBricks(deltalake) and to use Spark to improve the ELT/ETL performance of existing DW.

Now, I have done some lookups in databricks blogs, spark documentation.

Below is the very high level as-is functionality :-->

1) The datawarehouse batch runs nightly . This uses data form 5 different OLTP systems.

2)There is a CDC process which runs to move the delta from each OLTP system's data to  the DW Unix server.

3)When all the OLTP systems CDC completes, Talend ETL runs (which is heavy on Oracle PL/SQL code in tdb component) to do the transformation, and then finally it stores the data into DW tables.

This ETL is taking really long time as there are heavy transactional load comes from all the OLTP system and the aggregation phases are running for 7 hours on some occassion (month end due to interest accrual transactions)

Now My questions are as follows:->

1) To replace the CDC process, if I use jdbc connection/Apache Spark connection , I would need to schedule that in each night (assuming that I would stil want to run in nightly batch) automatically. We already have trigger files, which denote the process completion in each OLTP Database. So, based on those trigger files or by any other method, how we can schedule the connection between OLTP systems and Databricks.( could be a silly question. but would be good to know) to start the sync between OLTP and databricks.

2) I would definitely replace the Talend ETL with spark (that is the main goal to rearchitect the tech stack 🙂 )once the data is read from step1. Now, for optimal performance , assuming at max 10-15million trasactions need to be processed and then aggregation is performed, what kind of standard practice I should follow , which would be cost effective as well as performant.

3) When I am saving the information to deltalake ,which will eventually work as datawarehouse , what are the things I need to keep in mind.

Please let me know your inputs .

2 REPLIES 2

Thanks a lot for the insight, much appreciated.

I am keen in undersatnding the first phase, that is, reading the data from OLTP systems to data bricks via nightly schedule.

Could you provide some additional reads/blogs links or can provide bit more technical aspect on the replacing existing CDC by databricks connectors to Oracle or any RDBMS datbase and  how to schedule the sync .

Thanks in advance

arijit_sani
New Contributor II

Hi, Could you please advise on this

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group