<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Streaming problems after Vaccum in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/streaming-problems-after-vaccum/m-p/53771#M29878</link>
    <description>&lt;P&gt;Thanks a lot for the details. One point I still don't get is the difference between these two points (and let's forget vacuum for this):&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;EM&gt;If your streaming query ran 2 weeks ago, it will not reprocess all table versions since then.&lt;/EM&gt;&lt;/LI&gt;&lt;LI&gt;&lt;EM&gt;Instead, it will process only the new records introduced during that time period.&lt;/EM&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Let's say my source delta table version is 2500. I execute a streaming job once with&amp;nbsp;&lt;SPAN&gt;availableNow=True. So it loads everything up to table version 2500.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;Now for two weeks I insert, delete and update data in this source table. After 2 weeks, I'm at version 2750. Now I execute the streaming job again.&lt;/P&gt;&lt;P&gt;I don't understand the difference: everything between versions 2500 and 2750 is exactly what has changed? Does the second bullet point mean, it only processes inserts but not deletes and updates?&lt;/P&gt;&lt;P&gt;Thanks for clearifying.&lt;/P&gt;</description>
    <pubDate>Fri, 24 Nov 2023 19:14:44 GMT</pubDate>
    <dc:creator>pgruetter</dc:creator>
    <dc:date>2023-11-24T19:14:44Z</dc:date>
    <item>
      <title>Streaming problems after Vaccum</title>
      <link>https://community.databricks.com/t5/data-engineering/streaming-problems-after-vaccum/m-p/51040#M28940</link>
      <description>&lt;P&gt;Hi all&lt;/P&gt;&lt;P&gt;To read from a large Delta table, I'm using readStream but with a trigger(availableNow=True) as I only want to run it daily. This worked well for an intial load and then incremental loads after that.&lt;/P&gt;&lt;P&gt;At some point though, I received an error from the source Delta table that a parquet file referenced by the index is not available anymore.&amp;nbsp;&lt;/P&gt;&lt;P&gt;I know that a VACUUM command is periodically issued against the source table but with the default of 7 days.&lt;BR /&gt;My incremental load was not executed for 2 weeks. Could that be a problem?&lt;/P&gt;&lt;P&gt;How does readStream work exactly: If it ran 2 weeks ago, will it try to read all table versions since then? That could explain the error as it would reference parquet files from &amp;gt; 7 days.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;</description>
      <pubDate>Mon, 13 Nov 2023 11:50:53 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/streaming-problems-after-vaccum/m-p/51040#M28940</guid>
      <dc:creator>pgruetter</dc:creator>
      <dc:date>2023-11-13T11:50:53Z</dc:date>
    </item>
    <item>
      <title>Re: Streaming problems after Vaccum</title>
      <link>https://community.databricks.com/t5/data-engineering/streaming-problems-after-vaccum/m-p/53771#M29878</link>
      <description>&lt;P&gt;Thanks a lot for the details. One point I still don't get is the difference between these two points (and let's forget vacuum for this):&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;EM&gt;If your streaming query ran 2 weeks ago, it will not reprocess all table versions since then.&lt;/EM&gt;&lt;/LI&gt;&lt;LI&gt;&lt;EM&gt;Instead, it will process only the new records introduced during that time period.&lt;/EM&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Let's say my source delta table version is 2500. I execute a streaming job once with&amp;nbsp;&lt;SPAN&gt;availableNow=True. So it loads everything up to table version 2500.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;Now for two weeks I insert, delete and update data in this source table. After 2 weeks, I'm at version 2750. Now I execute the streaming job again.&lt;/P&gt;&lt;P&gt;I don't understand the difference: everything between versions 2500 and 2750 is exactly what has changed? Does the second bullet point mean, it only processes inserts but not deletes and updates?&lt;/P&gt;&lt;P&gt;Thanks for clearifying.&lt;/P&gt;</description>
      <pubDate>Fri, 24 Nov 2023 19:14:44 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/streaming-problems-after-vaccum/m-p/53771#M29878</guid>
      <dc:creator>pgruetter</dc:creator>
      <dc:date>2023-11-24T19:14:44Z</dc:date>
    </item>
  </channel>
</rss>

