cancel
Showing results for 
Search instead for 
Did you mean: 
Administration & Architecture
cancel
Showing results for 
Search instead for 
Did you mean: 

Architecture choice, streaming data

baatchus
New Contributor III

I have sensor data coming into Azure Event Hub and need some help in deciding how to best ingest it into the Data Lake and Delta Lake:

Option 1:

azure event hub > databricks structured streaming > delta lake (bronze)

Option 2:

azure event hub > event hub capture to Azure Data Lake gen 2 > Databricks Autoloader > delta lake(bronze)

No need for realtime only process when needed. Please state the reasons for choosing either option..

1 ACCEPTED SOLUTION

Accepted Solutions

Ryan_Chynoweth
Honored Contributor III

Hi @baatch us​ , this is a great question. Option 1 is very ideal if you require realtime processing of your data. Since you noted that you only need to process data when you need I would think that Option 2 is a better choice for you.

Option 1 would require 24/7 processing (i.e. 24/7 cluster) which is more costly than you need. Since you can do batch processing Option 2 would be more cost effective. Event hubs should allow you to dump directly into ADLS without an intermediate tool.

If you ever do require stream processing it wouldn't be difficult to switch between your two options.

View solution in original post

6 REPLIES 6

Kaniz
Community Manager
Community Manager

Hi @ baatchus ! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question first. Or else I will get back to you soon. Thanks.

Ryan_Chynoweth
Honored Contributor III

Hi @baatch us​ , this is a great question. Option 1 is very ideal if you require realtime processing of your data. Since you noted that you only need to process data when you need I would think that Option 2 is a better choice for you.

Option 1 would require 24/7 processing (i.e. 24/7 cluster) which is more costly than you need. Since you can do batch processing Option 2 would be more cost effective. Event hubs should allow you to dump directly into ADLS without an intermediate tool.

If you ever do require stream processing it wouldn't be difficult to switch between your two options.

@Ryan Chynoweth​ 

thanks for the reply.

Option 1 can be configured as trigger once so both options can be regarded as batch. So will need to decide on what will be the best and most cost effective option? Also keep in mind Azure Event Hub only has 7 day retention if that matters in the architectural decision?

Option 1 (Trigger once, every 24 hour)

azure event hub > databricks structured streaming > delta lake (bronze)

Option 2 (Trigger once, every 24 hour)

azure event hub > event hub capture to Azure Data Lake gen 2 > Databricks Autoloader > delta lake(bronze)

Ryan_Chynoweth
Honored Contributor III

I do think that the 7 day retention should be considered. It may be a good idea to go with option 1 for your data pipeline and use the trigger once option. But I would also use the data capture capabilities in the event hub to archive all your data to a raw landing zone.

Hubert-Dudek
Esteemed Contributor III

If batch job is possible and you need to process data I would use probably:

azure event hub from (events after previous job run) > databricks job process as dataframe > save df to delta lake

no streaming or capturing needed in that case

Kaniz
Community Manager
Community Manager

Hi @baatch us​ , Just a friendly follow-up. Do you still need help, or does the above response help you to find the solution? Please let us know.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.