Databricks Community

VCA50380 · ‎02-14-2025

Hi all,
- migrating from an on-premise Oracle -

Currently on Oracle, I have a "library" of let's say 300 tables to load, sequentially, based on views (some tables being fed potentially by several views, therefore the number of underlying views is higher).
All of this is processed by Oracle via PL/SQL, dynamic SQLs, etc.. and all is working very well.
The volumetry per table is not big, as for example in 60% of the cases, the number of rows per table is lower than ..let's say 10 000. Queries are complex, yes, but volumetry is relatively low.
It runs approximately every 2 hours, and as we are logging all (start, end of each processes, errors of course, etc.), the number of "actions" (select, update, inserts, etc..) on the database is growing high very quickly in a short amount of time.
Again, this is the current situation in Oracle and all is fine.

Now in Databricks, is it possible to reproduce the same via PySpark (and things like "spark.sql(...)", etc..) and should I be aware of some potential issues on the performance side?
Assuming we will have a clean environment, with the opportunity to have some computes correctly "sized", can I assume that I don't risk much in terms of performance?
What would be the recommendations for this ?

I've already performed a small POC, reproducing what I'm doing in Oracle, but on a very small scale.
All is fine and working as expected.
But I can't reproduce a bigger workload.

Any opinion or advice on this?
Thanks!

Vincent

VCA50380 · ‎03-04-2025

Hello Rjdudley,
Thanks for the answer. Yes, this is clear, once all is ready in Gold layer, our plan was to have the reporting tools directly accessing this and "nothing more" if I can say.
Now, I feel that I'm not really going to have an answer on the original topic - I'm not negative against you, this is not meant this way, but I feel this topic is not popular or understood enough.
So I'll close there.
Thanks!

View solution in original post

VCA50380 · ‎02-18-2025

Hi all,

I asked to Gemini, who said:
"In general, PySpark is designed to handle large volumes of data and complex processing efficiently. However, sequentially running a large number of requests can lead to performance degradation if the cluster resources are not sized properly or if the requests are not optimized."

Let's say that our queries are fully optimized, there would be then only the point on cluster resources.
But I would prefer to have a human opinion... 🙂

Rjdudley · ‎02-18-2025

Should be no problem, but the pitfall you want to avoid is treating Databricks like a traditional database and trying to do everything in SparkSQL. Yes, SparkSQL will work but Python or DB-SQL might be better choices since they are higher level and add in some performance enhancements.

Also, in Databricks, you need to think of a processing pipeline instead of a single database, and plan for a medallion architecture. It sounds like you're starting with a bunch of tables which are loaded through streaming (? - you're not clear on this), and you need to process the incoming data every couple of hours. The tables where the data lands are your bronze layer. You mentioned your queries are complex, which is a good use case for Python but DB-SQL can also work. You probably calculate some intermediate values before arriving at your final results. The intermediates are a silver layer, and the final results are probably gold.

Can you do everything in one schema using PySpark? Sure. But you're not using the system to its fullest capabilities. If you haven't, do the free Lakehouse Fundamentals course in the Databricks Academy.

VCA50380 · ‎02-19-2025

Hello Rjdudley,

Thanks for your answer! This is really appreciate.
I think I understand your point, and maybe I've described the situation not correctly.
I've said I would have a bunch of 300 tables, yes, but it is true that there is a "structure" behind:
. Let's say - currently in Oracle - we have 3 layers (simplifying): Data Warehouse tables, DataMart tables and reporting tables.
Each layer is based on the data of the previous layer.
Currently in Oracle (trying to describe in a simple way), the PL/SQL is just there to orchestrate the loading of the tables. We are passing to it only statements like "INSERT INTO mytable SELECT * FROM myview", statements which are prepare already in a kind of dictionary table.
The PL/SQL just need to parse this dictionary table, check whether a table should be refreshed or not, and if yes execute the statement.

This is this kind of logic I was thinking to reproduce within Databricks, via PySpark.
In this case, is this logic "acceptable" ?

My understanding is that the layer we call currently "Data Warehouse" is the bronze layer in Databricks, the "DataMart" is the silver layer and the "reporting tables" are the gold layer.
Is it correct ?

Thanks!

Note: Yes, I've registered to some Databricks Academy courses, I try to make it work with my calendar...

MariuszK · ‎02-19-2025

For orchestration you can use Databricks workflows or ADF.
The organization for lakehouse should be:

Staging -> bronze
gold -> "Data Warehouse" / Data marts
You can read more here about migration and how to replicated data warehouse to medalion:
https://medium.com/@mariusz_kujawski/migration-from-oracle-data-warehouse-to-databricks-lakehouse-6e...

Rjdudley · ‎02-20-2025

In a Lakehouse, the bronze layer is the raw data, while the gold layer is the most processed and curated, analogous to a data warehouse. From your reply, it sounds like you have an analytical data warehouse at the corporate level, and you need to copy data from the warehouse to reporting data marts, maybe at the department or geographic region level? That's a common BI pattern in very large companies.

If this is the case, and you're taking analytical data from a warehouse into reporting marts, then Databricks will work, but it's kind of overkill if this is the only use case for Databricks. Databricks itself is an analytics platform. If the only use for Databricks is migrating data, you might be better served by a more traditional ETL tool.

However, if your company is going to replace the Oracle entirely, and do all your analytics in Databricks, then copying data into data marts is possible but you have better options. You can share data directly from Databricks to reporting tools like PowerBI or Tableau without copying the data.

VCA50380 · ‎02-24-2025

Hello Rjdudley,

To answer your first part, no we don't need to copy or migrate data.
We are supposed at a point in time to leave Oracle and do everything in Databricks. This is at least what we are asked to evaluate, based on the current needs and all things implemented since a so long time in Oracle.
Your description of bronze, silver and gold is exactly the logic we already have in the structure we organized in Oracle, even if of course we are not using this terminology.
At least, on the functional side, what is described with Bronze, Silver and Gold is the same as what we already do in Oracle and what we plan to do in Databricks.
You said: " You can share data directly from Databricks to reporting tools like PowerBI or Tableau without copying the data.". I don't really understand this, sorry.
" gold layer is the most processed and curated": the equivalent of this layer in Oracle is based on complex queries, and the results of this queries are materialized in tables. Do you mean that in Databricks I should not materialize this data? Sorry, I'm getting lost..

Rjdudley · ‎02-24-2025

" gold layer is the most processed and curated": the equivalent of this layer in Oracle is based on complex queries, and the results of this queries are materialized in tables. Do you mean that in Databricks I should not materialize this data? Sorry, I'm getting lost..

You absolutely should materialize this data. What I am saying is after this is calculated you can keep it in your Databricks gold layer, you don't need to copy it anywhere else. You can use Delta Sharing to connect Power BI or Tableau directly to Databricks. Saves you some work and gives you better data governance.

VCA50380 · ‎03-04-2025

Hello Rjdudley,
Thanks for the answer. Yes, this is clear, once all is ready in Gold layer, our plan was to have the reporting tools directly accessing this and "nothing more" if I can say.
Now, I feel that I'm not really going to have an answer on the original topic - I'm not negative against you, this is not meant this way, but I feel this topic is not popular or understood enough.
So I'll close there.
Thanks!

MariuszK · ‎02-18-2025

The common scenario for data processing in Oracle is based on PL/SQL and cursors. In the case of PySpark we don't have such concept as cursors and iteration on data frames can lead to poor performance.

I migrated Oracle to Databricks and I learned that in many cases we can translate PL/SQL to PySpark, but sometime there is a need to translate it to Databricks Workflows or adjust existing code to avoid loops and typical cursors constructions.

If you can share code it will be easier to advice.

VCA50380 · ‎02-19-2025

Hello MariuszK,

Thanks for your answer!
To be clear, as I tried to describe to Rjdudley,
I'm not planning to do any transformation via PySpark, or parsing any individual dataset / dataframe coming out of each SQL.
This is really not what is done either in PL/SQL

Databricks Community

Efficacy of PySpark in Databricks

Photos

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!