cancel
Showing results for 
Search instead for 
Did you mean: 
Community Articles
Dive into a collaborative space where members like YOU can exchange knowledge, tips, and best practices. Join the conversation today and unlock a wealth of collective wisdom to enhance your experience and drive success.
cancel
Showing results for 
Search instead for 
Did you mean: 

Validating pointer-based Delta comparison architecture using flatMapGroupsWithState in Structured St

VamsiDatabricks
New Contributor

Hi everyone,

I’m leading an implementation where we’re comparing events from two real-time streams — a Source and a Target — in Databricks Structured Streaming (Scala).

Our goal is to identify and emit “delta” differences between corresponding records from both sides based on a common naturalId.

Here’s the high-level architecture we’ve designed:

  • Both Source and Target streams (from Kafka/Event Hubs) are read as structured streaming datasets.

  • Each event is parsed, hashed (SHA-256), and persisted as full JSON to Delta Lake (for durability, auditability, and replay).

  • Only lightweight metadata (key, hash, timestamp, Delta pointer) is kept in Spark state.

  • We use flatMapGroupsWithState with event-time timeout + watermarking to hold state per key until both sides arrive.

  • Once both Source and Target events for a given key are available, we fetch their corresponding JSONs from Delta using the stored pointers, perform the comparison, emit a DeltaRecord, and clear the state.

  • Late or missing events are automatically handled via watermark expiry, and deltas are written back to Delta for downstream consumption.

Here’s a simplified pseudocode snippet:

keyed.flatMapGroupsWithState[StateValue, DeltaRecord](
OutputMode.Append(),
GroupStateTimeout.EventTimeTimeout()
)(handleKey)

Could you please validate this approach for ,

  • Any hidden pitfalls in production especially around Delta I/O under load, event skew, or watermarking.

  • Whether others have adopted similar pointer-based approaches for large-scale streaming comparisons, and any tuning lessons learned.

 

Appreciate any feedback, design critiques, or optimization suggestions from those who’ve run this pattern at scale 🙏

Thanks,
Vamsi

#StructuredStreaming, #flatMapGroupsWithState, and #DeltaLake

 

0 REPLIES 0

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now