I am new to Spark and DataBricks and exploring these to understand to replace Oracle DataWarehouse by DataBricks(deltalake) and to use Spark to improve the ELT/ETL performance of existing DW.
Now, I have done some lookups in databricks blogs, spark documentation.
Below is the very high level as-is functionality :-->
1) The datawarehouse batch runs nightly . This uses data form 5 different OLTP systems.
2)There is a CDC process which runs to move the delta from each OLTP system's data to the DW Unix server.
3)When all the OLTP systems CDC completes, Talend ETL runs (which is heavy on Oracle PL/SQL code in tdb component) to do the transformation, and then finally it stores the data into DW tables.
This ETL is taking really long time as there are heavy transactional load comes from all the OLTP system and the aggregation phases are running for 7 hours on some occassion (month end due to interest accrual transactions)
Now My questions are as follows:->
1) To replace the CDC process, if I use jdbc connection/Apache Spark connection , I would need to schedule that in each night (assuming that I would stil want to run in nightly batch) automatically. We already have trigger files, which denote the process completion in each OLTP Database. So, based on those trigger files or by any other method, how we can schedule the connection between OLTP systems and Databricks.( could be a silly question. but would be good to know) to start the sync between OLTP and databricks.
2) I would definitely replace the Talend ETL with spark (that is the main goal to rearchitect the tech stack 🙂 )once the data is read from step1. Now, for optimal performance , assuming at max 10-15million trasactions need to be processed and then aggregation is performed, what kind of standard practice I should follow , which would be cost effective as well as performant.
3) When I am saving the information to deltalake ,which will eventually work as datawarehouse , what are the things I need to keep in mind.
Please let me know your inputs .