<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Autoloader: how to avoid overlap in files in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/autoloader-how-to-avoid-overlap-in-files/m-p/23954#M16619</link>
    <description>&lt;P&gt;I'm thinking of using autoloader to process files being put on our data lake.&lt;/P&gt;&lt;P&gt;Let's say f.e. every 15 minutes, a parquet files is written.  These files however contain &lt;B&gt;overlapping data&lt;/B&gt;.&lt;/P&gt;&lt;P&gt;Now, every 2 hours I want to process the new data (autoloader) and merge into a delta lake table.&lt;/P&gt;&lt;P&gt;This seems pretty trivial, but unfortunately it is not:&lt;/P&gt;&lt;P&gt;when the autoloader fetches the new data, the streaming query will contain duplicate data of 2 types: actual dups (can be dropped with dropDuplicates), but also different versions of the same record (a record can be updated multiple times during a period of time).  I want to process only the most recent version (based on a change date column).&lt;/P&gt;&lt;P&gt;For this last part, I don't see how I can fix this with a streaming query.&lt;/P&gt;&lt;P&gt;For batch, I would use a window function which partitions by the semantic key (id) and sorts on a timestamp.&lt;/P&gt;&lt;P&gt;But for streaming this is not possible.&lt;/P&gt;&lt;P&gt;So, any ideas?&lt;/P&gt;&lt;P&gt;Basically it is the 'spark streaming keep most recent record in group' issue.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Thu, 03 Nov 2022 14:02:30 GMT</pubDate>
    <dc:creator>-werners-</dc:creator>
    <dc:date>2022-11-03T14:02:30Z</dc:date>
    <item>
      <title>Autoloader: how to avoid overlap in files</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-how-to-avoid-overlap-in-files/m-p/23954#M16619</link>
      <description>&lt;P&gt;I'm thinking of using autoloader to process files being put on our data lake.&lt;/P&gt;&lt;P&gt;Let's say f.e. every 15 minutes, a parquet files is written.  These files however contain &lt;B&gt;overlapping data&lt;/B&gt;.&lt;/P&gt;&lt;P&gt;Now, every 2 hours I want to process the new data (autoloader) and merge into a delta lake table.&lt;/P&gt;&lt;P&gt;This seems pretty trivial, but unfortunately it is not:&lt;/P&gt;&lt;P&gt;when the autoloader fetches the new data, the streaming query will contain duplicate data of 2 types: actual dups (can be dropped with dropDuplicates), but also different versions of the same record (a record can be updated multiple times during a period of time).  I want to process only the most recent version (based on a change date column).&lt;/P&gt;&lt;P&gt;For this last part, I don't see how I can fix this with a streaming query.&lt;/P&gt;&lt;P&gt;For batch, I would use a window function which partitions by the semantic key (id) and sorts on a timestamp.&lt;/P&gt;&lt;P&gt;But for streaming this is not possible.&lt;/P&gt;&lt;P&gt;So, any ideas?&lt;/P&gt;&lt;P&gt;Basically it is the 'spark streaming keep most recent record in group' issue.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 03 Nov 2022 14:02:30 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-how-to-avoid-overlap-in-files/m-p/23954#M16619</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2022-11-03T14:02:30Z</dc:date>
    </item>
    <item>
      <title>Re: Autoloader: how to avoid overlap in files</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-how-to-avoid-overlap-in-files/m-p/23955#M16620</link>
      <description>&lt;P&gt;What about forEachBatch and then MERGE?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Alternatively, run another process that will clean updates using the window function, as you said.&lt;/P&gt;</description>
      <pubDate>Thu, 03 Nov 2022 14:21:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-how-to-avoid-overlap-in-files/m-p/23955#M16620</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2022-11-03T14:21:29Z</dc:date>
    </item>
    <item>
      <title>Re: Autoloader: how to avoid overlap in files</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-how-to-avoid-overlap-in-files/m-p/23956#M16621</link>
      <description>&lt;P&gt;forEachBatch is an option, but then the merge will take a long time (merge per file).&lt;/P&gt;&lt;P&gt;Also (I forgot to mention that): a single file can also contain multiple versions of a single record.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Not using autoloader seems the way to go at the moment, but it would be nice if it is possible after all without a lot of overhead.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 03 Nov 2022 14:28:59 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-how-to-avoid-overlap-in-files/m-p/23956#M16621</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2022-11-03T14:28:59Z</dc:date>
    </item>
  </channel>
</rss>

