Databricks Community

VamsiDatabricks · ‎10-14-2025

Hi everyone,

I’m leading an implementation where we’re comparing events from two real-time streams — a Source and a Target — in Databricks Structured Streaming (Scala).

Our goal is to identify and emit “delta” differences between corresponding records from both sides based on a common naturalId.

Here’s the high-level architecture we’ve designed:

Both Source and Target streams (from Kafka/Event Hubs) are read as structured streaming datasets.
Each event is parsed, hashed (SHA-256), and persisted as full JSON to Delta Lake (for durability, auditability, and replay).
Only lightweight metadata (key, hash, timestamp, Delta pointer) is kept in Spark state.
We use flatMapGroupsWithState with event-time timeout + watermarking to hold state per key until both sides arrive.
Once both Source and Target events for a given key are available, we fetch their corresponding JSONs from Delta using the stored pointers, perform the comparison, emit a DeltaRecord, and clear the state.
Late or missing events are automatically handled via watermark expiry, and deltas are written back to Delta for downstream consumption.

Here’s a simplified pseudocode snippet:

keyed.flatMapGroupsWithState[StateValue, DeltaRecord](
OutputMode.Append(),
GroupStateTimeout.EventTimeTimeout()
)(handleKey)

Could you please validate this approach for ,

Any hidden pitfalls in production especially around Delta I/O under load, event skew, or watermarking.
Whether others have adopted similar pointer-based approaches for large-scale streaming comparisons, and any tuning lessons learned.

Appreciate any feedback, design critiques, or optimization suggestions from those who’ve run this pattern at scale 🙏

Thanks,
Vamsi

#StructuredStreaming, #flatMapGroupsWithState, and #DeltaLake

Databricks Community

Validating pointer-based Delta comparison architecture using flatMapGroupsWithState in Structured St

Join Us as a Local Community Builder!

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

🌟 Community Pulse: Your Weekly Roundup! November 14 – 20, 2025

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples