<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Validating pointer-based Delta comparison architecture using flatMapGroupsWithState in Structured St in Community Articles</title>
    <link>https://community.databricks.com/t5/community-articles/validating-pointer-based-delta-comparison-architecture-using/m-p/134883#M728</link>
    <description>&lt;P&gt;Hi everyone,&lt;/P&gt;&lt;P&gt;I’m leading an implementation where we’re comparing events from two real-time streams — a &lt;STRONG&gt;Source&lt;/STRONG&gt; and a &lt;STRONG&gt;Target&lt;/STRONG&gt; — in Databricks Structured Streaming (Scala).&lt;/P&gt;&lt;P&gt;Our goal is to identify and emit “delta” differences between corresponding records from both sides based on a common naturalId.&lt;/P&gt;&lt;P&gt;Here’s the high-level architecture we’ve designed:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Both &lt;STRONG&gt;Source&lt;/STRONG&gt; and &lt;STRONG&gt;Target&lt;/STRONG&gt; streams (from Kafka/Event Hubs) are read as structured streaming datasets.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Each event is parsed, hashed (SHA-256), and persisted as full JSON to &lt;STRONG&gt;Delta Lake&lt;/STRONG&gt; (for durability, auditability, and replay).&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Only lightweight metadata (key, hash, timestamp, Delta pointer) is kept in Spark state.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;We use &lt;STRONG&gt;flatMapGroupsWithState&lt;/STRONG&gt; with &lt;STRONG&gt;event-time timeout + watermarking&lt;/STRONG&gt; to hold state per key until both sides arrive.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Once both Source and Target events for a given key are available, we fetch their corresponding JSONs from Delta using the stored pointers, perform the comparison, emit a DeltaRecord, and clear the state.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Late or missing events are automatically handled via watermark expiry, and deltas are written back to Delta for downstream consumption.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Here’s a simplified pseudocode snippet:&lt;BR /&gt;&lt;BR /&gt;keyed.flatMapGroupsWithState[StateValue, DeltaRecord](&lt;BR /&gt;OutputMode.Append(),&lt;BR /&gt;GroupStateTimeout.EventTimeTimeout()&lt;BR /&gt;)(handleKey)&lt;BR /&gt;&lt;BR /&gt;Could you please validate this approach for ,&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Any &lt;STRONG&gt;hidden pitfalls&lt;/STRONG&gt; in production especially around Delta I/O under load, event skew, or watermarking.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;Whether others have adopted similar &lt;STRONG&gt;pointer-based approaches&lt;/STRONG&gt; for large-scale streaming comparisons, and any tuning lessons learned.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Appreciate any feedback, design critiques, or optimization suggestions from those who’ve run this pattern at scale &lt;span class="lia-unicode-emoji" title=":folded_hands:"&gt;🙏&lt;/span&gt;&lt;/P&gt;&lt;P&gt;Thanks,&lt;BR /&gt;&lt;EM&gt;Vamsi&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;#StructuredStreaming, #flatMapGroupsWithState, and #DeltaLake&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Tue, 14 Oct 2025 16:44:54 GMT</pubDate>
    <dc:creator>VamsiDatabricks</dc:creator>
    <dc:date>2025-10-14T16:44:54Z</dc:date>
    <item>
      <title>Validating pointer-based Delta comparison architecture using flatMapGroupsWithState in Structured St</title>
      <link>https://community.databricks.com/t5/community-articles/validating-pointer-based-delta-comparison-architecture-using/m-p/134883#M728</link>
      <description>&lt;P&gt;Hi everyone,&lt;/P&gt;&lt;P&gt;I’m leading an implementation where we’re comparing events from two real-time streams — a &lt;STRONG&gt;Source&lt;/STRONG&gt; and a &lt;STRONG&gt;Target&lt;/STRONG&gt; — in Databricks Structured Streaming (Scala).&lt;/P&gt;&lt;P&gt;Our goal is to identify and emit “delta” differences between corresponding records from both sides based on a common naturalId.&lt;/P&gt;&lt;P&gt;Here’s the high-level architecture we’ve designed:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Both &lt;STRONG&gt;Source&lt;/STRONG&gt; and &lt;STRONG&gt;Target&lt;/STRONG&gt; streams (from Kafka/Event Hubs) are read as structured streaming datasets.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Each event is parsed, hashed (SHA-256), and persisted as full JSON to &lt;STRONG&gt;Delta Lake&lt;/STRONG&gt; (for durability, auditability, and replay).&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Only lightweight metadata (key, hash, timestamp, Delta pointer) is kept in Spark state.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;We use &lt;STRONG&gt;flatMapGroupsWithState&lt;/STRONG&gt; with &lt;STRONG&gt;event-time timeout + watermarking&lt;/STRONG&gt; to hold state per key until both sides arrive.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Once both Source and Target events for a given key are available, we fetch their corresponding JSONs from Delta using the stored pointers, perform the comparison, emit a DeltaRecord, and clear the state.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Late or missing events are automatically handled via watermark expiry, and deltas are written back to Delta for downstream consumption.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Here’s a simplified pseudocode snippet:&lt;BR /&gt;&lt;BR /&gt;keyed.flatMapGroupsWithState[StateValue, DeltaRecord](&lt;BR /&gt;OutputMode.Append(),&lt;BR /&gt;GroupStateTimeout.EventTimeTimeout()&lt;BR /&gt;)(handleKey)&lt;BR /&gt;&lt;BR /&gt;Could you please validate this approach for ,&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Any &lt;STRONG&gt;hidden pitfalls&lt;/STRONG&gt; in production especially around Delta I/O under load, event skew, or watermarking.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;Whether others have adopted similar &lt;STRONG&gt;pointer-based approaches&lt;/STRONG&gt; for large-scale streaming comparisons, and any tuning lessons learned.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Appreciate any feedback, design critiques, or optimization suggestions from those who’ve run this pattern at scale &lt;span class="lia-unicode-emoji" title=":folded_hands:"&gt;🙏&lt;/span&gt;&lt;/P&gt;&lt;P&gt;Thanks,&lt;BR /&gt;&lt;EM&gt;Vamsi&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;#StructuredStreaming, #flatMapGroupsWithState, and #DeltaLake&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 14 Oct 2025 16:44:54 GMT</pubDate>
      <guid>https://community.databricks.com/t5/community-articles/validating-pointer-based-delta-comparison-architecture-using/m-p/134883#M728</guid>
      <dc:creator>VamsiDatabricks</dc:creator>
      <dc:date>2025-10-14T16:44:54Z</dc:date>
    </item>
  </channel>
</rss>

