<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Streaming Live Table - What is actually computed? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/streaming-live-table-what-is-actually-computed/m-p/93371#M38685</link>
    <description>&lt;P&gt;Can anyone please share in a DLT or structured streaming task, what group of rows are computed?&lt;/P&gt;&lt;P&gt;Specific scenarios:&lt;/P&gt;&lt;P&gt;1. when a streaming table A joining a delta table B. Is each of the minibatches in A joining the whole delta table? Does Spark compute the joining from each minibatch with the whole table B?&lt;/P&gt;&lt;P&gt;2. when a streaming table A joining another streaming table B.&amp;nbsp; Does Spark compute the joining from only the new minibatch in A with minibatch in B?&amp;nbsp; or the whole table A is joining the whole table B?&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;</description>
    <pubDate>Thu, 10 Oct 2024 04:30:58 GMT</pubDate>
    <dc:creator>tliuzillow</dc:creator>
    <dc:date>2024-10-10T04:30:58Z</dc:date>
    <item>
      <title>Streaming Live Table - What is actually computed?</title>
      <link>https://community.databricks.com/t5/data-engineering/streaming-live-table-what-is-actually-computed/m-p/93371#M38685</link>
      <description>&lt;P&gt;Can anyone please share in a DLT or structured streaming task, what group of rows are computed?&lt;/P&gt;&lt;P&gt;Specific scenarios:&lt;/P&gt;&lt;P&gt;1. when a streaming table A joining a delta table B. Is each of the minibatches in A joining the whole delta table? Does Spark compute the joining from each minibatch with the whole table B?&lt;/P&gt;&lt;P&gt;2. when a streaming table A joining another streaming table B.&amp;nbsp; Does Spark compute the joining from only the new minibatch in A with minibatch in B?&amp;nbsp; or the whole table A is joining the whole table B?&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;</description>
      <pubDate>Thu, 10 Oct 2024 04:30:58 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/streaming-live-table-what-is-actually-computed/m-p/93371#M38685</guid>
      <dc:creator>tliuzillow</dc:creator>
      <dc:date>2024-10-10T04:30:58Z</dc:date>
    </item>
    <item>
      <title>Re: Streaming Live Table - What is actually computed?</title>
      <link>https://community.databricks.com/t5/data-engineering/streaming-live-table-what-is-actually-computed/m-p/93396#M38694</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/125810"&gt;@tliuzillow&lt;/a&gt;&amp;nbsp;,&lt;BR /&gt;&lt;BR /&gt;1. Stream-static Join: Each minibatch from the streaming table (A) is joined with the entire Delta table (B).&amp;nbsp;&lt;/P&gt;&lt;P&gt;2. Stream-stream Join: Each minibatch from the streaming table(A) is joined with minibatch from the streaming table(B).&amp;nbsp;&lt;/P&gt;&lt;P&gt;However, as &lt;A href="https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#joining-with-streaming-dataframesdatasets" target="_self"&gt;per documentation&lt;/A&gt; "&lt;EM&gt;t&lt;/EM&gt;&lt;SPAN&gt;&lt;EM&gt;he challenge of generating join results between two data streams is that, at any point of time, the view of the dataset is incomplete for both sides of the join making it much harder to find matches between inputs.&lt;/EM&gt;&amp;nbsp;" &lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;This is why Spark can also keep the historical data in the buffer, which allows to match incoming data with past records, thus ensuring complete join results.&lt;BR /&gt;&lt;BR /&gt;To implement this, you will use watermarking.&amp;nbsp;Here is the code sample from the above documentation:&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;from pyspark.sql.functions import expr

impressions = spark.readStream. ...
clicks = spark.readStream. ...

# Apply watermarks on event-time columns
impressionsWithWatermark = impressions.withWatermark("impressionTime", "2 hours")
clicksWithWatermark = clicks.withWatermark("clickTime", "3 hours")

# Join with event-time constraints
impressionsWithWatermark.join(
  clicksWithWatermark,
  expr("""
    clickAdId = impressionAdId AND
    clickTime &amp;gt;= impressionTime AND
    clickTime &amp;lt;= impressionTime + interval 1 hour
    """)
)&lt;/LI-CODE&gt;&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 10 Oct 2024 07:05:11 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/streaming-live-table-what-is-actually-computed/m-p/93396#M38694</guid>
      <dc:creator>filipniziol</dc:creator>
      <dc:date>2024-10-10T07:05:11Z</dc:date>
    </item>
  </channel>
</rss>

