03-26-2022 06:58 PM
Delta lake have 3 levels to maintain data quality ( bronze , silver and gold tables ) but this supports the reporting and
BI solutions how does this supports the streaming analytics ?
example : I have an app that loads all the operational data in adls gen 2 and i want use this data which will help me to support real time reporting needs and also i need to build a report with all the historical data stored in ADLS gen 2.
So For this do i need to store data into two different container of ADLS gen2 ? one container will have all history data and other will store day of data to support real time reporting ? and for above requirement i want make use of same spark cluster to support both real time reporting needs as well has descriptive analytics with all history data.
03-28-2022 01:42 AM
You certainly can connect BI tools to the 'classic' clusters. That works fine, but databricks SQL is optimized for fast read queries (Analytics tools) . So you will get better performance.
03-28-2022 01:17 AM
delta lake does not has levels. The bronze/silver/gold is a concept to structure the data lake.
You do not need a separate ADLS container, I would even advice against it. Why? Well, you have to ingest the data anyway, so it's best to do this only once (one container). Within a single container you can structure the data any way you want. It does not have to be bronze/silver/gold but can be anything that works for you/your company. I do suggest to foresee some kind of structure, it will help you in the long run.
For the reporting part: if you want to run reports, Databricks SQL is better than the classic spark clusters (which excel in engineering/ML). Databricks SQL supports concurrency ofc.
03-28-2022 01:23 AM
@Werner Stinckens Then Do i need to use Databricks SQL end point to Connect to BI tools ?
If yes then i should create separate Spark Runtime Cluster for it ? So this means i need two spark clusters one to run ETL or ELT jobs and other one to Query Delta Tables ?
03-28-2022 01:42 AM
You certainly can connect BI tools to the 'classic' clusters. That works fine, but databricks SQL is optimized for fast read queries (Analytics tools) . So you will get better performance.
03-28-2022 07:00 AM
@Werner Stinckens So in this context how data bricks eliminates the need of lamda architecture? we anyway need two spark clusters one to process etl or elt jobs and other to connect the BI tools ( SQL end point )
03-28-2022 07:06 AM
Not exactly. Lambda basically is streaming + batch simultaneously.
You can achieve that with databricks.
ingest real time data into your blob storage, and run reports on that data.
Also transform the data into f.e. golden layer in batch and also report on that.
But you don't need separate storage for the realtime (or near realtime) and batch.
Having separate clusters for SQL workloads and ETL/ELT is a good idea, although technically not necessary. But I would do it.
03-28-2022 09:22 AM
@Werner Stinckens i think we i need to connect with support team on this … need to connect SQL end point of data bricks to MicroStrategy .
Databricks is our one hub going forward and I want to understand the elimination of lambda architecture.
though the storage is same ADLS gen 2 .....to do reporting for real time where Kafka stores always one day worth of data in the kafka cluster or a week of data max to support real time analytics so we are thinking of storing one day of data in separate container and to support real time and years worth of data in separate container to support batch and historical reporting.
Need to understand ....let us know how to connect support team
03-28-2022 11:41 PM
https://docs.microsoft.com/en-us/azure/databricks/integrations/bi/microstrategy
Your proposed solution to have 2 containers (1 for real time and one for long term) is certainly possible, but not required.
You could land ALL data into a bronze layer (streaming or not) for example. And move on from that layer.
The main attention point is that you should try to minimize the query load on the source system (so avoid ingesting the same data twice).
03-29-2022 02:50 AM
@Werner Stinckens if i keep anything in one container how does that support batch and real time analytics ?
Example : i have a container with delta table which has 2 years of data that supports BI reporting for historical data.
if i use same delta table how does that supports real time it unnecessary scans all 2 years data which is not needed and would definitely hamper performance.
to support real time i need have 1 or 2 days worth of data right ?
are you getting what i am addressing here ?
03-31-2022 02:54 AM
I get it. I think you have another understanding of 'container' than me.
Basically you are talking about 2 'tables', a realtime one and a batch one.
Whereas that is certainly an option, you could also pull it off in one single table using partitioning on date. For the realtime queries you can then apply a filter on the partitioning column to apply partition pruning (only relevant partitions are selected).
03-31-2022 03:44 AM
@Werner Stinckens why i would do that ? rather i will prefer storing the tables in two containers one supports for real time which holds only one day worth of data and one more container which will hold all historical data.
This way can we not make BI work? that supports real time analytics and historical needs of data for reporting?
Please consider of connecting traditional BI tool to modern data platform i.e Databricks .
03-31-2022 03:45 AM
@Basavaraj Angadi , Why? For simplicity, cost savings etc.
You can make it work with 2 'containers' but it is not necessary.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group