<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Autoloader: Backfill on millions of files in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/autoloader-backfill-on-millions-of-files/m-p/91892#M38292</link>
    <description>&lt;P&gt;if you use backfill, I think it will check all those old files you skipped with the init.&lt;BR /&gt;there is the option &lt;SPAN class=""&gt;maxFileAge&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp; &lt;/SPAN&gt;but the minimum value is 14 days, and dbrx recommends 90 days.&lt;BR /&gt;Honestly: I would move all those old files you don't want to process to another directory (or subdir), or apply partitioning (which is in fact the same).&lt;/P&gt;</description>
    <pubDate>Thu, 26 Sep 2024 14:18:26 GMT</pubDate>
    <dc:creator>-werners-</dc:creator>
    <dc:date>2024-09-26T14:18:26Z</dc:date>
    <item>
      <title>Autoloader: Backfill on millions of files</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-backfill-on-millions-of-files/m-p/91873#M38282</link>
      <description>&lt;P&gt;Hi all!&lt;BR /&gt;&lt;BR /&gt;So I've been using Autoloader with&amp;nbsp;File Notification mode against Azure to great success. Once past all the setup, it's rather seamless to use. I did have some issues in the beginning which is related to my question&lt;/P&gt;&lt;P&gt;The storage account I'm working against has 4 years worth of data in JSON files, this is a total amount of 10-15TB of data or so. I noticed that I had to set `&lt;SPAN&gt;includeExistingFiles` to false to achieve better latency, which is as expected.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;However, I'm a bit worried about backfill as well, so I wanted to get more information about it, so my questions are as follows.&lt;/SPAN&gt;&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;Does backfill list all content in the blob storage, then compare with processed files in the checkpoint? For millions and millions of files, this is going to take a long time to do right?&lt;/LI&gt;&lt;LI&gt;If 1. is true, what should the way forward be ? Do you segment storage accounts into months instead?&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;Hope I'm making sense here &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 26 Sep 2024 11:11:42 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-backfill-on-millions-of-files/m-p/91873#M38282</guid>
      <dc:creator>Adam_Runarsson</dc:creator>
      <dc:date>2024-09-26T11:11:42Z</dc:date>
    </item>
    <item>
      <title>Re: Autoloader: Backfill on millions of files</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-backfill-on-millions-of-files/m-p/91892#M38292</link>
      <description>&lt;P&gt;if you use backfill, I think it will check all those old files you skipped with the init.&lt;BR /&gt;there is the option &lt;SPAN class=""&gt;maxFileAge&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp; &lt;/SPAN&gt;but the minimum value is 14 days, and dbrx recommends 90 days.&lt;BR /&gt;Honestly: I would move all those old files you don't want to process to another directory (or subdir), or apply partitioning (which is in fact the same).&lt;/P&gt;</description>
      <pubDate>Thu, 26 Sep 2024 14:18:26 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-backfill-on-millions-of-files/m-p/91892#M38292</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2024-09-26T14:18:26Z</dc:date>
    </item>
    <item>
      <title>Re: Autoloader: Backfill on millions of files</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-backfill-on-millions-of-files/m-p/91894#M38294</link>
      <description>&lt;P&gt;Thanks for your reply!&lt;BR /&gt;&lt;BR /&gt;Yeah, I was thinking along those lines, switching the directory structure around into year/month/day/type and then point my streams to the year/month folders, then once per month update my source folder - that won't play well with checkpoints though (as far as I understand it, once you start the stream with a source, the checkpoint will store that source and updating won't apply).&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;Perhaps something like this&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Start a stream with source 2024/09/* with checkpoint /chko/2024-09&lt;/LI&gt;&lt;LI&gt;When a new month arrives, start another stream 2024/10/*&amp;nbsp; with checkpoint /chko/2024-10 - Then allow some grace period before the previous stream is turned off incase some files are still missing.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;It's not optimal, but perhaps the only way forward for this kind of use-case. Then for the historical backfill, I would have to have a separate stream for that I guess.&lt;/P&gt;</description>
      <pubDate>Thu, 26 Sep 2024 14:35:03 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-backfill-on-millions-of-files/m-p/91894#M38294</guid>
      <dc:creator>Adam_Runarsson</dc:creator>
      <dc:date>2024-09-26T14:35:03Z</dc:date>
    </item>
    <item>
      <title>Re: Autoloader: Backfill on millions of files</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-backfill-on-millions-of-files/m-p/92002#M38315</link>
      <description>&lt;P&gt;The docs are pretty sparse on the backfill process, but I think backfill won't just do a scan of the directory but will instead read the checkpoint file.&amp;nbsp; That seems logical to me anyways.&lt;/P&gt;</description>
      <pubDate>Fri, 27 Sep 2024 07:11:00 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-backfill-on-millions-of-files/m-p/92002#M38315</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2024-09-27T07:11:00Z</dc:date>
    </item>
  </channel>
</rss>

