Hi everyone,
I’m leading an implementation where we’re comparing events from two real-time streams — a Source and a Target — in Databricks Structured Streaming (Scala).
Our goal is to identify and emit “delta” differences between corresponding records from both sides based on a common naturalId.
Here’s the high-level architecture we’ve designed:
Both Source and Target streams (from Kafka/Event Hubs) are read as structured streaming datasets.
Each event is parsed, hashed (SHA-256), and persisted as full JSON to Delta Lake (for durability, auditability, and replay).
Only lightweight metadata (key, hash, timestamp, Delta pointer) is kept in Spark state.
We use flatMapGroupsWithState with event-time timeout + watermarking to hold state per key until both sides arrive.
Once both Source and Target events for a given key are available, we fetch their corresponding JSONs from Delta using the stored pointers, perform the comparison, emit a DeltaRecord, and clear the state.
Late or missing events are automatically handled via watermark expiry, and deltas are written back to Delta for downstream consumption.
Here’s a simplified pseudocode snippet:
keyed.flatMapGroupsWithState[StateValue, DeltaRecord](
OutputMode.Append(),
GroupStateTimeout.EventTimeTimeout()
)(handleKey)
Could you please validate this approach for ,
Appreciate any feedback, design critiques, or optimization suggestions from those who’ve run this pattern at scale 🙏
Thanks,
Vamsi
#StructuredStreaming, #flatMapGroupsWithState, and #DeltaLake