I am designing structured streaming job in Azure data bricks(using Scala) which will consume messages from two event hubs, lets call them source and target.
I would like your feedback on below flow, whether it is will survive the production load and suggestions to make it better.
- we consume messages from source and target EventHub
- Parse and get the id which will be used to identify the message uniquely and also store the actual message in storage account.
- Now we will have a one data set for both source and target.
- Combine both source and target dataset and create a single dataset and group them by unique key.
- On the keyed Dataset, we will run flatmapgroupswithstate operator so that the comparison runs for every minute and checks if source and target keys exist.
- Once both Source and Target events for a given key are available, we fetch their corresponding JSONs from Delta using the stored pointers, perform the comparison, emit a DeltaRecord, and clear the state.
This is the design diagram. Structured Streaming
Here’s a simplified pseudocode snippet:
keyed.flatMapGroupsWithState[StateValue, DeltaRecord](
OutputMode.Append(),
GroupStateTimeout.EventTimeTimeout()
)(handleKey)
Could you please validate this approach for ,
Any hidden pitfalls in production especially around Delta I/O under load, event skew, or watermarking.
Whether others have adopted similar pointer-based approaches for large-scale streaming comparisons, and any tuning lessons learned.
Appreciate any feedback, design critiques, or optimization suggestions from those who’ve run this pattern at scale