cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Hi Experts i am new to data bricks and i want to know how data bricks supports real time reporting needs in Business intelligence?

BasavarajAngadi
Contributor

Delta lake have 3 levels to maintain data quality ( bronze , silver and gold tables ) but this supports the reporting and 

BI solutions how does this supports the streaming analytics ?

example : I have an app that loads all the operational data in adls gen 2 and i want use this data which will help me to support real time  reporting needs and also i need to build a report with all the historical data stored in ADLS gen 2.

So For this do i need to store data into two different container of ADLS gen2 ? one container will have all history data and other will store   day of data to support real time reporting ?  and for above requirement i want make use of same spark cluster to support both real time reporting needs as well has descriptive analytics with all history data.

1 ACCEPTED SOLUTION

Accepted Solutions

-werners-
Esteemed Contributor III

You certainly can connect BI tools to the 'classic' clusters. That works fine, but databricks SQL is optimized for fast read queries (Analytics tools) . So you will get better performance.

View solution in original post

12 REPLIES 12

-werners-
Esteemed Contributor III

delta lake does not has levels. The bronze/silver/gold is a concept to structure the data lake.

You do not need a separate ADLS container, I would even advice against it. Why? Well, you have to ingest the data anyway, so it's best to do this only once (one container). Within a single container you can structure the data any way you want. It does not have to be bronze/silver/gold but can be anything that works for you/your company. I do suggest to foresee some kind of structure, it will help you in the long run.

For the reporting part: if you want to run reports, Databricks SQL is better than the classic spark clusters (which excel in engineering/ML). Databricks SQL supports concurrency ofc.

BasavarajAngadi
Contributor

@Werner Stinckensโ€‹  Then Do i need to use Databricks SQL end point to Connect to BI tools ?

If yes then i should create separate Spark Runtime Cluster for it ? So this means i need two spark clusters one to run ETL or ELT jobs and other one to Query Delta Tables ?

-werners-
Esteemed Contributor III

You certainly can connect BI tools to the 'classic' clusters. That works fine, but databricks SQL is optimized for fast read queries (Analytics tools) . So you will get better performance.

BasavarajAngadi
Contributor

@Werner Stinckensโ€‹  So in this context how data bricks eliminates the need of lamda architecture? we anyway need two spark clusters one to process etl or elt jobs and other to connect the BI tools ( SQL end point )

-werners-
Esteemed Contributor III

Not exactly. Lambda basically is streaming + batch simultaneously.

You can achieve that with databricks.

ingest real time data into your blob storage, and run reports on that data.

Also transform the data into f.e. golden layer in batch and also report on that.

But you don't need separate storage for the realtime (or near realtime) and batch.

Having separate clusters for SQL workloads and ETL/ELT is a good idea, although technically not necessary. But I would do it.

BasavarajAngadi
Contributor

@Werner Stinckensโ€‹  i think we i need to connect with support team on this โ€ฆ need to connect SQL end point of data bricks to MicroStrategy .

Databricks is our one hub going forward and I want to understand the elimination of lambda architecture.

though the storage is same ADLS gen 2 .....to do reporting for real time where Kafka stores always one day worth of data in the kafka cluster or a week of data max to support real time analytics so we are thinking of storing one day of data in separate container and to support real time and years worth of data in separate container to support batch and historical reporting.

Need to understand ....let us know how to connect support team

-werners-
Esteemed Contributor III

https://docs.microsoft.com/en-us/azure/databricks/integrations/bi/microstrategy

Your proposed solution to have 2 containers (1 for real time and one for long term) is certainly possible, but not required.

You could land ALL data into a bronze layer (streaming or not) for example. And move on from that layer.

The main attention point is that you should try to minimize the query load on the source system (so avoid ingesting the same data twice).

BasavarajAngadi
Contributor

@Werner Stinckensโ€‹  if i keep anything in one container how does that support batch and real time analytics ?

Example : i have a container with delta table which has 2 years of data that supports BI reporting for historical data.

if i use same delta table how does that supports real time it unnecessary scans all 2 years data which is not needed and would definitely hamper performance.

to support real time i need have 1 or 2 days worth of data right ?

are you getting what i am addressing here ?

-werners-
Esteemed Contributor III

I get it. I think you have another understanding of 'container' than me.

Basically you are talking about 2 'tables', a realtime one and a batch one.

Whereas that is certainly an option, you could also pull it off in one single table using partitioning on date. For the realtime queries you can then apply a filter on the partitioning column to apply partition pruning (only relevant partitions are selected).

BasavarajAngadi
Contributor

@Werner Stinckensโ€‹  why i would do that ? rather i will prefer storing the tables in two containers one supports for real time which holds only one day worth of data and one more container which will hold all historical data.

This way can we not make BI work? that supports real time analytics and historical needs of data for reporting?

Please consider of connecting traditional BI tool to modern data platform i.e Databricks .

-werners-
Esteemed Contributor III

@Basavaraj Angadiโ€‹ , Why? For simplicity, cost savings etc.

You can make it work with 2 'containers' but it is not necessary.

Kaniz
Community Manager
Community Manager

@Basavaraj Angadiโ€‹ , Just a friendly follow-up. Do you still need help, or @Werner Stinckensโ€‹ 's response help you to find the solution? Please let us know.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.