4 weeks ago
Hello all,
I'm quite new to Databricks world, and currently in the process of analyzing a migration from Oracle on-premise (with a lot of SQL, PL/SQL, custom things, etc.) to Databricks.
Let's try to illustrate my situation in Oracle (summary):
. Let's say we have currently 3 main "areas":
- a dataware-house layer
- a datamart layer
- the last layer with reporting tables, etc. exposed to some reporting tools.
The data in the dataware-house layer is stored in dedicated tables, refreshed every 2 hours, each tables being fed by several views.
The data in the datamart layer is stored in dedicated tables too, same refresh rate as above, each tables being fed by a view that will refer to one or several tables from the dataware-house layer.
The data in the reporting layer is stored in dedicated tables too, same refresh rate as above, each tables being fed by a view referring to several tables from the datamart layer.
Question:
- In Databricks world, what is the best practice in order to reproduce the same "flow"?
Is it the fact that I should use Notebooks attached to a pipeline, Notebooks in which I would write any relevant code in order either to perform updates / inserts / delete / truncate (or even trigger the refresh of a materialized view)?
Or is there a preferable way?
I see that I could use the same way either Python, Scala or SQL in Notebooks.
For this kind of task to perform, can I simply use what I'm most comfortable with? Or should I better choose the functionalities provided by Python (due to some libraries, etc) ?
Thanks!
Vincent
3 weeks ago
Yes, the process you described is a normal and common practice in Databricks. Creating tables via "classic" SQL statements and then using Databricks Notebooks to write the relevant code for loading and transforming these tables is a standard approach. These Notebooks can be scheduled to run at specific intervals using Databricks Jobs, which helps in creating a "flow" for your data pipelines.
Regarding alternative tools, while Databricks provides robust native solutions like Delta Live Tables (DLT) and Databricks Jobs for automating and orchestrating data workflows, there are also other tools that can connect to Databricks and automate the loading of tables. For example, Azure Data Factory (ADF) can be used to orchestrate Databricks jobs as part of an ADF pipeline. ADF supports running Databricks notebooks, Python scripts, or code packaged in JARs within its pipelines, providing an additional layer of automation and integration capabilities.
4 weeks ago
Hello, your approach seems to be correct, also the usage of the code is also accepted to be working with that one you feel more confortable, I will also suggest you to go through blog which explains best practices to migrate from Oracle to Databricks https://www.databricks.com/blog/how-migrate-your-oracle-plsql-code-databricks-lakehouse-platform
3 weeks ago
Hello,
Thanks for your answer.
To summarize my situation:
. I have tables created via "classic" SQL statements, which then appears as "Delta" table in Databricks.
. Then, in order to load this tables (based on existing views in Databricks), the normal process is to write the relevant code in Notebooks and use this in order to create a "flow".
This is a normal process in Databricks, correct?
Do you know if people using Databricks would use different logic in order to perform the same tasks (for the same purpose) ? For example, based an additional tool that would connect to Databricks and automate the loading tables?
Thanks!
Vincent
3 weeks ago
Yes, the process you described is a normal and common practice in Databricks. Creating tables via "classic" SQL statements and then using Databricks Notebooks to write the relevant code for loading and transforming these tables is a standard approach. These Notebooks can be scheduled to run at specific intervals using Databricks Jobs, which helps in creating a "flow" for your data pipelines.
Regarding alternative tools, while Databricks provides robust native solutions like Delta Live Tables (DLT) and Databricks Jobs for automating and orchestrating data workflows, there are also other tools that can connect to Databricks and automate the loading of tables. For example, Azure Data Factory (ADF) can be used to orchestrate Databricks jobs as part of an ADF pipeline. ADF supports running Databricks notebooks, Python scripts, or code packaged in JARs within its pipelines, providing an additional layer of automation and integration capabilities.
3 weeks ago
Hello,
Thanks for your clear answers!
It will be very helpful for sure for me in a near future.
On my side, looking forward to work more with Databricks ... ๐
Thanks
Vincent
3 weeks ago
Hello, sure if any additional information is needed just let us know and we will be happy to assist.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group