cancel
Showing results for 
Search instead for 
Did you mean: 
Warehousing & Analytics
Engage in discussions on data warehousing, analytics, and BI solutions within the Databricks Community. Share insights, tips, and best practices for leveraging data for informed decision-making.
cancel
Showing results for 
Search instead for 
Did you mean: 

Best practice to materialize data

VCA50380
New Contributor III

Hello all,
I'm quite new to Databricks world, and currently in the process of analyzing a migration from Oracle on-premise (with a lot of SQL, PL/SQL, custom things, etc.) to Databricks.

Let's try to illustrate my situation in Oracle (summary):
. Let's say we have currently 3 main "areas":
 - a dataware-house layer
 - a datamart layer
 - the last layer with reporting tables, etc. exposed to some reporting tools.
The data in the dataware-house layer is stored in dedicated tables, refreshed every 2 hours, each tables being fed by several views.
The data in the datamart layer is stored in dedicated tables too, same refresh rate as above, each tables being fed by a view that will refer to one or several tables from the dataware-house layer.
The data in the reporting layer is stored in dedicated tables too, same refresh rate as above, each tables being fed by a view referring to several tables from the datamart layer.

Question:
- In Databricks world, what is the best practice in order to reproduce the same "flow"?
Is it the fact that I should use Notebooks attached to a pipeline, Notebooks in which I would write any relevant code in order either to perform updates / inserts / delete / truncate (or even trigger the refresh of a materialized view)?
Or is there a preferable way?
I see that I could use the same way either Python, Scala or SQL in Notebooks. 
For this kind of task to perform, can I simply use what I'm most comfortable with? Or should I better choose the functionalities provided by Python (due to some libraries, etc) ?

Thanks!
Vincent

1 ACCEPTED SOLUTION

Accepted Solutions

Walter_C
Databricks Employee
Databricks Employee

Yes, the process you described is a normal and common practice in Databricks. Creating tables via "classic" SQL statements and then using Databricks Notebooks to write the relevant code for loading and transforming these tables is a standard approach. These Notebooks can be scheduled to run at specific intervals using Databricks Jobs, which helps in creating a "flow" for your data pipelines.

Regarding alternative tools, while Databricks provides robust native solutions like Delta Live Tables (DLT) and Databricks Jobs for automating and orchestrating data workflows, there are also other tools that can connect to Databricks and automate the loading of tables. For example, Azure Data Factory (ADF) can be used to orchestrate Databricks jobs as part of an ADF pipeline. ADF supports running Databricks notebooks, Python scripts, or code packaged in JARs within its pipelines, providing an additional layer of automation and integration capabilities.

View solution in original post

5 REPLIES 5

Walter_C
Databricks Employee
Databricks Employee

Hello, your approach seems to be correct, also the usage of the code is also accepted to be working with that one you feel more confortable, I will also suggest you to go through blog which explains best practices to migrate from Oracle to Databricks https://www.databricks.com/blog/how-migrate-your-oracle-plsql-code-databricks-lakehouse-platform

VCA50380
New Contributor III

Hello,

Thanks for your answer.
To summarize my situation:
. I have tables created via "classic" SQL statements, which then appears as "Delta" table in Databricks.
. Then, in order to load this tables (based on existing views in Databricks), the normal process is to write the relevant code in Notebooks and use this in order to create a "flow".
This is a normal process in Databricks, correct?
Do you know if people using Databricks would use different logic in order to perform the same tasks (for the same purpose) ? For example, based an additional tool that would connect to Databricks and automate the loading tables?
Thanks!
Vincent

 

 

 

Walter_C
Databricks Employee
Databricks Employee

Yes, the process you described is a normal and common practice in Databricks. Creating tables via "classic" SQL statements and then using Databricks Notebooks to write the relevant code for loading and transforming these tables is a standard approach. These Notebooks can be scheduled to run at specific intervals using Databricks Jobs, which helps in creating a "flow" for your data pipelines.

Regarding alternative tools, while Databricks provides robust native solutions like Delta Live Tables (DLT) and Databricks Jobs for automating and orchestrating data workflows, there are also other tools that can connect to Databricks and automate the loading of tables. For example, Azure Data Factory (ADF) can be used to orchestrate Databricks jobs as part of an ADF pipeline. ADF supports running Databricks notebooks, Python scripts, or code packaged in JARs within its pipelines, providing an additional layer of automation and integration capabilities.

VCA50380
New Contributor III

Hello,
Thanks for your clear answers!
It will be very helpful for sure for me in a near future.
On my side, looking forward to work more with Databricks ... 🙂

Thanks
Vincent

Walter_C
Databricks Employee
Databricks Employee

Hello, sure if any additional information is needed just let us know and we will be happy to assist.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group