topic Re: Architecture choice, streaming data in Administration & Architecture

Architecture choice, streaming data

baatchus — Thu, 20 Mar 2025 16:41:08 GMT

I have sensor data coming into Azure Event Hub and need some help in deciding how to best ingest it into the Data Lake and Delta Lake:

Option 1:

azure event hub > databricks structured streaming > delta lake (bronze)

Option 2:

azure event hub > event hub capture to Azure Data Lake gen 2 > Databricks Autoloader > delta lake(bronze)

No need for realtime only process when needed. Please state the reasons for choosing either option..

Re: Architecture choice, streaming data

Ryan_Chynoweth — Mon, 11 Oct 2021 20:18:10 GMT

Hi @baatch us , this is a great question. Option 1 is very ideal if you require realtime processing of your data. Since you noted that you only need to process data when you need I would think that Option 2 is a better choice for you.

Option 1 would require 24/7 processing (i.e. 24/7 cluster) which is more costly than you need. Since you can do batch processing Option 2 would be more cost effective. Event hubs should allow you to dump directly into ADLS without an intermediate tool.

If you ever do require stream processing it wouldn't be difficult to switch between your two options.

Re: Architecture choice, streaming data

baatchus — Tue, 12 Oct 2021 09:16:09 GMT

@Ryan Chynoweth

thanks for the reply.

Option 1 can be configured as trigger once so both options can be regarded as batch. So will need to decide on what will be the best and most cost effective option? Also keep in mind Azure Event Hub only has 7 day retention if that matters in the architectural decision?

Option 1 (Trigger once, every 24 hour)

azure event hub > databricks structured streaming > delta lake (bronze)

Option 2 (Trigger once, every 24 hour)

azure event hub > event hub capture to Azure Data Lake gen 2 > Databricks Autoloader > delta lake(bronze)

Re: Architecture choice, streaming data

Hubert-Dudek — Tue, 12 Oct 2021 13:26:51 GMT

If batch job is possible and you need to process data I would use probably:

azure event hub from (events after previous job run) > databricks job process as dataframe > save df to delta lake

no streaming or capturing needed in that case

Re: Architecture choice, streaming data

Ryan_Chynoweth — Tue, 12 Oct 2021 16:30:54 GMT

I do think that the 7 day retention should be considered. It may be a good idea to go with option 1 for your data pipeline and use the trigger once option. But I would also use the data capture capabilities in the event hub to archive all your data to a raw landing zone.