cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

When to use delta lake versus relational database as a source for BI reporting?

prasadvaze
Valued Contributor II

Assume all of your data exists in delta tables and also in SQL server so you have a choice to report from either. Can someone share thoughts on "In what scenario you would not want report created from delta table and instead use the traditional relational database source and why" .   Now one can create reports on delta tables using BI tools (Qlik, SSRS, Tableau etc)

For e.g. Interactive query /report is better on relational db versus delta lake because of performance etc. ( this may not be necessarily true), Business User does not understand steps involved in setting up spark cluster and connection to delta table from a BI tool , it's a hassle (or training issue?). I am interested in hearing what would be technical and non-technical reasons people think of not pointing BI tool to delta lake for reporting

10 REPLIES 10

Ryan_Chynoweth
Esteemed Contributor

While this may not directly answer your question, I think it is important to consider the total cost of ownership for storing and maintaining two sets of data.

As an example, let's say we ingest data into cloud storage and process it through a delta lake using Databricks. At this point the data is "report ready" and we then write the data to SQL Server. In this sense you have additional compute costs for writing the data, additional storage costs for the SQL Server, and the total pipeline would run longer because there are additional/unnecessary steps. So in most cases it would make sense to read data directly from the delta lake using BI Tools.

In most cases we integrate easy with BI Tools. For example, users can easily use Tableau to connect to a Databricks SQL Endpoint to consume data directly from Databricks using built in connectors. This connection can all be done using the UI of the BI Tool. SQL Endpoints are really easy to manage as users simply need to select a size (small, medium, large etc.) and click create.

With Databricks SQL it is much less common to have a relational database since Databricks satisfies the necessary requirements.

I would say that if you are sourcing data for a web application it may make more sense to get data in a relational database. But you specifically mentioned BI tools.

prasadvaze
Valued Contributor II

@Ryan Chynowethโ€‹ why do you not recommend using delta lake as a backend for web application? It's meant for OLAP and not OLTP applications, right?

So if we are looking to solve for analytical reports only then would you eliminate relational db or MPP db (like snowflake, synapse , redshift) in lieu of delta lake? Any challenges you foresee in this strategy? anything like data security , fine-grained access controls , lack of foreign key constraint, any thing?

Does the delta table query performance degrade as number of versions grow with every transaction on the table , it creates new version of the table (and keeps it until vacuuming is done routinely )

Yes delta is primarily used for OLAP type applications. Web apps typically would use a high concurrency transactional database i.e. OLTP. That is why you would likely want a transactional database.

If you are looking for an analytics system then use Databricks + Delta Lake. This is a single platform for all your BI and ML needs. With traditional data warehouses (Snowflake, Synapse, Redshift) you are only looking at BI type workloads. Delta allows for all the DW features but on top of your data lake. I would suggest not going with a data warehouse because it will reduce your total cost of ownership and enable your team. Going with a Lakehouse as it has better support for more advanced data use cases.

Delta performance does not degrade as versions grow because we maintain specific transactions as well as snapshots of the data. Check out this blog that describes in detail how delta works.

Are you looking for a new analytics stack for your team?

-werners-
Esteemed Contributor III

for what's it worth:

we do all our data processing in databricks and finally copy our curated data to a dwh (for historic reasons) where most of our BI runs on.

This is in my opinion an anti-pattern as reporting directly on our data lake (delta lake + parquet) eliminates the data copy. You gain time (no more copy), less maintenance and a less complex architecture.

Of course you will have to assess if your BI tool is able to consume delta lake, parquet. Or use the SQL endpoints of Databricks (or some other SQL engine).

I want to get rid of our data warehouse as soon as I can.

Which Data Warehouse are you all copying your data to? Which BI tool are you using?

I would agree it seems that the DW is redundant. We support direct access for all of the most popular BI tools and users typically use SQL Endpoints to connect to their delta lake.

The total cost of ownership can dramatically decrease by eliminating the DW as it is an extra step in processing for replicating data. The SLAs usually get better too.

You're not "eliminating" the data warehouse. You're just using a different technology. Remember that EDW is a set of architectures and not a technology. Big difference.

prasadvaze
Valued Contributor II

@Werner Stinckensโ€‹  are you able to implement referential integrity on delta tables?

-werners-
Esteemed Contributor III

not by means of constraints. If you want RI, you can handle it in code.

At the moment there are only not-null and check constraints on delta lake.

prasadvaze
Valued Contributor II

In our case , we process data in ADLS using databricks and copy into synapse. Users develop using QlikSense and PowerBI /SSRS. They are very comfortable with T-sql versus sparkSQL. I think over some time they will come to accept no T-sql and no SQL server but just delta lake.

We also opened up databricks-SQL to users and first issue found that it doesn't have event-based trigger to run the report. we can only schedule by time.

Also delta lake does not enforce referential integrity constraints on certain tables we have in sql server tables

PCJ
New Contributor II

Hi @Kaniz Fatmaโ€‹  - I would like a follow-up on @prasad vazeโ€‹ question regarding unsupported referential integrity. How does one work around that, using best practices as Databricks sees it?

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group