cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

When to use delta lake versus relational database as a source for BI reporting?

prasadvaze
Valued Contributor

Assume all of your data exists in delta tables and also in SQL server so you have a choice to report from either. Can someone share thoughts on "In what scenario you would not want report created from delta table and instead use the traditional relational database source and why" .   Now one can create reports on delta tables using BI tools (Qlik, SSRS, Tableau etc)

For e.g. Interactive query /report is better on relational db versus delta lake because of performance etc. ( this may not be necessarily true), Business User does not understand steps involved in setting up spark cluster and connection to delta table from a BI tool , it's a hassle (or training issue?). I am interested in hearing what would be technical and non-technical reasons people think of not pointing BI tool to delta lake for reporting

11 REPLIES 11

Ryan_Chynoweth
Honored Contributor III

While this may not directly answer your question, I think it is important to consider the total cost of ownership for storing and maintaining two sets of data.

As an example, let's say we ingest data into cloud storage and process it through a delta lake using Databricks. At this point the data is "report ready" and we then write the data to SQL Server. In this sense you have additional compute costs for writing the data, additional storage costs for the SQL Server, and the total pipeline would run longer because there are additional/unnecessary steps. So in most cases it would make sense to read data directly from the delta lake using BI Tools.

In most cases we integrate easy with BI Tools. For example, users can easily use Tableau to connect to a Databricks SQL Endpoint to consume data directly from Databricks using built in connectors. This connection can all be done using the UI of the BI Tool. SQL Endpoints are really easy to manage as users simply need to select a size (small, medium, large etc.) and click create.

With Databricks SQL it is much less common to have a relational database since Databricks satisfies the necessary requirements.

I would say that if you are sourcing data for a web application it may make more sense to get data in a relational database. But you specifically mentioned BI tools.

prasadvaze
Valued Contributor

@Ryan Chynowethโ€‹ why do you not recommend using delta lake as a backend for web application? It's meant for OLAP and not OLTP applications, right?

So if we are looking to solve for analytical reports only then would you eliminate relational db or MPP db (like snowflake, synapse , redshift) in lieu of delta lake? Any challenges you foresee in this strategy? anything like data security , fine-grained access controls , lack of foreign key constraint, any thing?

Does the delta table query performance degrade as number of versions grow with every transaction on the table , it creates new version of the table (and keeps it until vacuuming is done routinely )

Ryan_Chynoweth
Honored Contributor III

Yes delta is primarily used for OLAP type applications. Web apps typically would use a high concurrency transactional database i.e. OLTP. That is why you would likely want a transactional database.

If you are looking for an analytics system then use Databricks + Delta Lake. This is a single platform for all your BI and ML needs. With traditional data warehouses (Snowflake, Synapse, Redshift) you are only looking at BI type workloads. Delta allows for all the DW features but on top of your data lake. I would suggest not going with a data warehouse because it will reduce your total cost of ownership and enable your team. Going with a Lakehouse as it has better support for more advanced data use cases.

Delta performance does not degrade as versions grow because we maintain specific transactions as well as snapshots of the data. Check out this blog that describes in detail how delta works.

Are you looking for a new analytics stack for your team?

-werners-
Esteemed Contributor III

for what's it worth:

we do all our data processing in databricks and finally copy our curated data to a dwh (for historic reasons) where most of our BI runs on.

This is in my opinion an anti-pattern as reporting directly on our data lake (delta lake + parquet) eliminates the data copy. You gain time (no more copy), less maintenance and a less complex architecture.

Of course you will have to assess if your BI tool is able to consume delta lake, parquet. Or use the SQL endpoints of Databricks (or some other SQL engine).

I want to get rid of our data warehouse as soon as I can.

Ryan_Chynoweth
Honored Contributor III

Which Data Warehouse are you all copying your data to? Which BI tool are you using?

I would agree it seems that the DW is redundant. We support direct access for all of the most popular BI tools and users typically use SQL Endpoints to connect to their delta lake.

The total cost of ownership can dramatically decrease by eliminating the DW as it is an extra step in processing for replicating data. The SLAs usually get better too.

You're not "eliminating" the data warehouse. You're just using a different technology. Remember that EDW is a set of architectures and not a technology. Big difference.

@Werner Stinckensโ€‹  are you able to implement referential integrity on delta tables?

-werners-
Esteemed Contributor III

not by means of constraints. If you want RI, you can handle it in code.

At the moment there are only not-null and check constraints on delta lake.

prasadvaze
Valued Contributor

In our case , we process data in ADLS using databricks and copy into synapse. Users develop using QlikSense and PowerBI /SSRS. They are very comfortable with T-sql versus sparkSQL. I think over some time they will come to accept no T-sql and no SQL server but just delta lake.

We also opened up databricks-SQL to users and first issue found that it doesn't have event-based trigger to run the report. we can only schedule by time.

Also delta lake does not enforce referential integrity constraints on certain tables we have in sql server tables

Kaniz
Community Manager
Community Manager

Hi @prasad vazeโ€‹, Just a friendly follow-up. Do you still need help or the above responses help you to find the solution? Please let us know.

PCJ
New Contributor II

Hi @Kaniz Fatmaโ€‹  - I would like a follow-up on @prasad vazeโ€‹ question regarding unsupported referential integrity. How does one work around that, using best practices as Databricks sees it?

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.