<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Update destination table when using Spark Structured Streaming and Delta tables in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/update-destination-table-when-using-spark-structured-streaming/m-p/41403#M27352</link>
    <description>&lt;P&gt;Hey&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/28727"&gt;@Mo&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I had the idea that stream to stream joins couldn't be performed when using "foreachBatch".&lt;BR /&gt;Is this not the case?&lt;/P&gt;&lt;P&gt;Also this part of the &lt;A href="https://spark.apache.org/docs/3.2.2/structured-streaming-programming-guide.html#stream-stream-joins:~:text=Additional%20details%20on,mode%20before%20joins." target="_self"&gt;documentation&lt;/A&gt; specifies that joins can only be used with "append" mode.&lt;BR /&gt;So it seems like it won't work with the merge approach&amp;nbsp;&lt;span class="lia-unicode-emoji" title=":confused_face:"&gt;😕&lt;/span&gt;&lt;/P&gt;</description>
    <pubDate>Thu, 24 Aug 2023 20:58:06 GMT</pubDate>
    <dc:creator>Agus1</dc:creator>
    <dc:date>2023-08-24T20:58:06Z</dc:date>
    <item>
      <title>Update destination table when using Spark Structured Streaming and Delta tables</title>
      <link>https://community.databricks.com/t5/data-engineering/update-destination-table-when-using-spark-structured-streaming/m-p/41258#M27319</link>
      <description>&lt;P&gt;I’m trying to implement a streaming pipeline that will run hourly using Spark Structured Streaming, Scala and Delta tables. The pipeline will process different items with their details.&lt;/P&gt;&lt;P&gt;The source are delta tables that already exists, written hourly using the&lt;SPAN&gt;&amp;nbsp;"&lt;/SPAN&gt;streamWrite"&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;command. The output should be another delta table that takes data from the source table, performs some transformations and writes to the destination table.&lt;/P&gt;&lt;P&gt;The problem I’m facing is that at different moments in time, the source table will bring new versions of items that were processed in the past (these are not duplicated messages, just an updated version of the same item). For these cases I need to update the item in the destination table in order to keep only the latest version.&lt;BR /&gt;Additionally, for some cases I need to use as source 2 streaming table and join them. Which blocks me from using "foreachBatch".&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;According to&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;A href="https://docs.databricks.com/en/structured-streaming/delta-lake.html#:%7E:text=Structured%20Streaming%20does%20not%20handle%20input%20that%20is%20not%20an%20append" target="_blank" rel="nofollow noopener noreferrer"&gt;this&lt;/A&gt;, Structured Streaming can only be used on “append” mode, but for my use case I would need to update the data when writing.&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Is there a way to make this work?&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;I feel that this should be a pretty common scenario that many implementations of streaming will have to face at some point, but I wasn’t able to find a way around it or any other published solutions so far.&lt;/P&gt;</description>
      <pubDate>Thu, 24 Aug 2023 13:10:54 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/update-destination-table-when-using-spark-structured-streaming/m-p/41258#M27319</guid>
      <dc:creator>Agus1</dc:creator>
      <dc:date>2023-08-24T13:10:54Z</dc:date>
    </item>
    <item>
      <title>Re: Update destination table when using Spark Structured Streaming and Delta tables</title>
      <link>https://community.databricks.com/t5/data-engineering/update-destination-table-when-using-spark-structured-streaming/m-p/41335#M27332</link>
      <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/28727"&gt;@Mo&lt;/a&gt;, thanks for the answer!&lt;/P&gt;&lt;P&gt;I've considered using foreachBatch, but there are 2 issues:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;For some cases I need to use as source 2 streaming table and join them. I believe this is not supported by foreachBatch?&lt;/LI&gt;&lt;LI&gt;Since I would be using merge for writting the output, would it be possible to stream &lt;STRONG&gt;from&lt;/STRONG&gt; the output table?&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Do you know if there is a way to resolve these??&lt;/P&gt;</description>
      <pubDate>Thu, 24 Aug 2023 13:05:36 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/update-destination-table-when-using-spark-structured-streaming/m-p/41335#M27332</guid>
      <dc:creator>Agus1</dc:creator>
      <dc:date>2023-08-24T13:05:36Z</dc:date>
    </item>
    <item>
      <title>Re: Update destination table when using Spark Structured Streaming and Delta tables</title>
      <link>https://community.databricks.com/t5/data-engineering/update-destination-table-when-using-spark-structured-streaming/m-p/41403#M27352</link>
      <description>&lt;P&gt;Hey&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/28727"&gt;@Mo&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I had the idea that stream to stream joins couldn't be performed when using "foreachBatch".&lt;BR /&gt;Is this not the case?&lt;/P&gt;&lt;P&gt;Also this part of the &lt;A href="https://spark.apache.org/docs/3.2.2/structured-streaming-programming-guide.html#stream-stream-joins:~:text=Additional%20details%20on,mode%20before%20joins." target="_self"&gt;documentation&lt;/A&gt; specifies that joins can only be used with "append" mode.&lt;BR /&gt;So it seems like it won't work with the merge approach&amp;nbsp;&lt;span class="lia-unicode-emoji" title=":confused_face:"&gt;😕&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 24 Aug 2023 20:58:06 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/update-destination-table-when-using-spark-structured-streaming/m-p/41403#M27352</guid>
      <dc:creator>Agus1</dc:creator>
      <dc:date>2023-08-24T20:58:06Z</dc:date>
    </item>
    <item>
      <title>Re: Update destination table when using Spark Structured Streaming and Delta tables</title>
      <link>https://community.databricks.com/t5/data-engineering/update-destination-table-when-using-spark-structured-streaming/m-p/41407#M27353</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/87255"&gt;@Agus1&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Could you try using CDC in delta. You could use readChangeFeed to read only the changes that got applied on the source table. This is also explained here.&lt;/P&gt;&lt;P&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/databricks/delta/delta-change-data-feed" target="_blank"&gt;https://learn.microsoft.com/en-us/azure/databricks/delta/delta-change-data-feed&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 25 Aug 2023 04:30:43 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/update-destination-table-when-using-spark-structured-streaming/m-p/41407#M27353</guid>
      <dc:creator>Tharun-Kumar</dc:creator>
      <dc:date>2023-08-25T04:30:43Z</dc:date>
    </item>
  </channel>
</rss>

