Databricks Community

VamsiDatabricks · ‎10-15-2025

I am designing structured streaming job in Azure data bricks(using Scala) which will consume messages from two event hubs, lets call them source and target.

I would like your feedback on below flow, whether it is will survive the production load and suggestions to make it better.

we consume messages from source and target EventHub
Parse and get the id which will be used to identify the message uniquely and also store the actual message in storage account.
Now we will have a one data set for both source and target.
Combine both source and target dataset and create a single dataset and group them by unique key.
On the keyed Dataset, we will run flatmapgroupswithstate operator so that the comparison runs for every minute and checks if source and target keys exist.
Once both Source and Target events for a given key are available, we fetch their corresponding JSONs from Delta using the stored pointers, perform the comparison, emit a DeltaRecord, and clear the state.

This is the design diagram. Structured Streaming

Here’s a simplified pseudocode snippet:

keyed.flatMapGroupsWithState[StateValue, DeltaRecord](
  OutputMode.Append(),
  GroupStateTimeout.EventTimeTimeout()
)(handleKey)

Could you please validate this approach for ,

Any hidden pitfalls in production especially around Delta I/O under load, event skew, or watermarking.

Whether others have adopted similar pointer-based approaches for large-scale streaming comparisons, and any tuning lessons learned.

Appreciate any feedback, design critiques, or optimization suggestions from those who’ve run this pattern at scale

Hubert-Dudek · ‎10-15-2025

It is hard to understand what the source is and what the target is. Some charts could be useful. Also, information on how long the state is kept. My solution usually is:
- Use declarative lakeflow pipelines if possible (dlt)

- if not, consider handling the state by yourself using transformWithStateInPandas (here is my example https://databrickster.medium.com/transformwithstate-is-here-to-clean-duplicates-77b86c359392)

- also sometimes easiest is just to use forEatchBatch and process stream as micro batches

VamsiDatabricks · ‎10-15-2025

Hi Hubert,
First of all Thank you for answering.

Source and target are Azure event hubs.
Here is the design , Could you please verify and help me enhance it, Thanks for all your support, much appreciated.

Databricks Community

Delta comparison architecture using flatMapGroupsWithState in Structured Streaming

Join Us as a Local Community Builder!

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

🌟 Community Pulse: Your Weekly Roundup! November 14 – 20, 2025

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples