Databricks Community

arijit_sani · ‎07-15-2024

I am new to Spark and DataBricks and exploring these to understand to replace Oracle DataWarehouse by DataBricks(deltalake) and to use Spark to improve the ELT/ETL performance of existing DW.

Now, I have done some lookups in databricks blogs, spark documentation.

Below is the very high level as-is functionality :-->

1) The datawarehouse batch runs nightly . This uses data form 5 different OLTP systems.

2)There is a CDC process which runs to move the delta from each OLTP system's data to the DW Unix server.

3)When all the OLTP systems CDC completes, Talend ETL runs (which is heavy on Oracle PL/SQL code in tdb component) to do the transformation, and then finally it stores the data into DW tables.

This ETL is taking really long time as there are heavy transactional load comes from all the OLTP system and the aggregation phases are running for 7 hours on some occassion (month end due to interest accrual transactions)

Now My questions are as follows:->

1) To replace the CDC process, if I use jdbc connection/Apache Spark connection , I would need to schedule that in each night (assuming that I would stil want to run in nightly batch) automatically. We already have trigger files, which denote the process completion in each OLTP Database. So, based on those trigger files or by any other method, how we can schedule the connection between OLTP systems and Databricks.( could be a silly question. but would be good to know) to start the sync between OLTP and databricks.

2) I would definitely replace the Talend ETL with spark (that is the main goal to rearchitect the tech stack 🙂 )once the data is read from step1. Now, for optimal performance , assuming at max 10-15million trasactions need to be processed and then aggregation is performed, what kind of standard practice I should follow , which would be cost effective as well as performant.

3) When I am saving the information to deltalake ,which will eventually work as datawarehouse , what are the things I need to keep in mind.

Please let me know your inputs .

Kaniz_Fatma · ‎07-16-2024

Hi @arijit_sani, Let's go step-by-step.

Scheduling Data Ingestion from OLTP Systems

Use Databricks Ingest or Partner Tools: Imagine Databricks Ingest and its buddies, like Fivetran, Qlik Replicate, and Arcion, as data ninjas. They sneak into your OLTP databases and stealthily pull data into Databricks, all based on the secret signals (triggers) or schedules you set. 🥷📅
Leverage Databricks Autoloader: Think of Autoloader as your trusty mail sorter. It processes new files arriving in cloud storage from your OLTP systems incrementally, making sure everything is organized and up-to-date without you lifting a finger. 📬🗂️
Schedule Databricks Jobs: Picture Databricks jobs as your reliable night shift workers. They run nightly or on triggers to manage and transform your data pipelines, ensuring everything is ready for the next day’s operations. 🌙🛠️

Optimizing Spark ETL Performance

1. Partition your Delta tables: Think of partitioning your Delta tables like organizing your wardrobe by season. By dividing your data into relevant columns, you can quickly find what you need without sifting through everything, leading to faster query performance. 🧥🩳

2. Use Z-Ordering: Imagine Z-Ordering as arranging books on a shelf by genre. It groups related data together, making it easier and faster to perform aggregations and filters. 📚🔎

3. Optimize joins: Optimizing joins is like seating small kids at the front of the classroom. By broadcasting small tables and using bucketing/sorting, you ensure everyone gets the best view and can quickly find their place. 🧒📚

4. Leverage Databricks Photon: Using Databricks Photon is like upgrading from a bicycle to a rocket ship. It supercharges your Spark SQL queries, making them up to 10x faster. 🚀💨

5. Compact small files: Compaction is like compressing a messy stack of papers into a neat binder. Using OPTIMIZE commands, you consolidate small files into larger ones, improving read performance. 📂📈

Best Practices for Delta Lake

1. Use Delta Lake MERGE commands: Think of MERGE commands as a super-efficient janitor that not only sweeps up old data but also inserts new data in one go. This keeps your data tidy and up-to-date without breaking a sweat. 🧹✨

2. Leverage Delta Lake time travel: Imagine having a time machine for your data, allowing you to view past versions whenever you want. This is perfect for checking out old reports or undoing mistakes. 🚀⏳

3. Implement a data retention policy: Set up a smart trash can that automatically disposes of old and irrelevant data. This keeps your storage lean and efficient without manual cleanups. 🗑️📉

4. Use Delta Sharing: Share your data like a secure dropbox, making sure only authorized users can access it. This keeps your data both accessible and safe. 🔒📤

5. Leverage Unity Catalog: Think of Unity Catalog as your data’s central command center, governing access and ensuring compliance with ease. 🗃️🛡️

By following these practices, you can significantly improve the performance and maintainability of your data warehouse on Databricks Delta Lake compared to your current Oracle-based solution. The key is leveraging Spark's distributed processing power, Delta Lake's optimizations, and Databricks’ enterprise features.

arijit_sani · ‎07-18-2024

Thanks a lot for the insight, much appreciated.

I am keen in undersatnding the first phase, that is, reading the data from OLTP systems to data bricks via nightly schedule.

Could you provide some additional reads/blogs links or can provide bit more technical aspect on the replacing existing CDC by databricks connectors to Oracle or any RDBMS datbase and how to schedule the sync .

Thanks in advance