10-11-2021 06:23 AM
I have sensor data coming into Azure Event Hub and need some help in deciding how to best ingest it into the Data Lake and Delta Lake:
Option 1:
azure event hub > databricks structured streaming > delta lake (bronze)
Option 2:
azure event hub > event hub capture to Azure Data Lake gen 2 > Databricks Autoloader > delta lake(bronze)
No need for realtime only process when needed. Please state the reasons for choosing either option..
10-11-2021 01:18 PM
Hi @baatch us , this is a great question. Option 1 is very ideal if you require realtime processing of your data. Since you noted that you only need to process data when you need I would think that Option 2 is a better choice for you.
Option 1 would require 24/7 processing (i.e. 24/7 cluster) which is more costly than you need. Since you can do batch processing Option 2 would be more cost effective. Event hubs should allow you to dump directly into ADLS without an intermediate tool.
If you ever do require stream processing it wouldn't be difficult to switch between your two options.
10-11-2021 07:47 AM
Hi @ baatchus ! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question first. Or else I will get back to you soon. Thanks.
10-11-2021 01:18 PM
Hi @baatch us , this is a great question. Option 1 is very ideal if you require realtime processing of your data. Since you noted that you only need to process data when you need I would think that Option 2 is a better choice for you.
Option 1 would require 24/7 processing (i.e. 24/7 cluster) which is more costly than you need. Since you can do batch processing Option 2 would be more cost effective. Event hubs should allow you to dump directly into ADLS without an intermediate tool.
If you ever do require stream processing it wouldn't be difficult to switch between your two options.
10-12-2021 02:16 AM
@Ryan Chynoweth
thanks for the reply.
Option 1 can be configured as trigger once so both options can be regarded as batch. So will need to decide on what will be the best and most cost effective option? Also keep in mind Azure Event Hub only has 7 day retention if that matters in the architectural decision?
Option 1 (Trigger once, every 24 hour)
azure event hub > databricks structured streaming > delta lake (bronze)
Option 2 (Trigger once, every 24 hour)
azure event hub > event hub capture to Azure Data Lake gen 2 > Databricks Autoloader > delta lake(bronze)
10-12-2021 09:30 AM
I do think that the 7 day retention should be considered. It may be a good idea to go with option 1 for your data pipeline and use the trigger once option. But I would also use the data capture capabilities in the event hub to archive all your data to a raw landing zone.
10-12-2021 06:26 AM
If batch job is possible and you need to process data I would use probably:
azure event hub from (events after previous job run) > databricks job process as dataframe > save df to delta lake
no streaming or capturing needed in that case
05-18-2022 02:38 PM
Hi @baatch us , Just a friendly follow-up. Do you still need help, or does the above response help you to find the solution? Please let us know.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group