topic Re: Autoloader: Backfill on millions of files in Data Engineering

Autoloader: Backfill on millions of files

Adam_Runarsson — Thu, 26 Sep 2024 11:11:42 GMT

Hi all!

So I've been using Autoloader with File Notification mode against Azure to great success. Once past all the setup, it's rather seamless to use. I did have some issues in the beginning which is related to my question

The storage account I'm working against has 4 years worth of data in JSON files, this is a total amount of 10-15TB of data or so. I noticed that I had to set `includeExistingFiles` to false to achieve better latency, which is as expected.

However, I'm a bit worried about backfill as well, so I wanted to get more information about it, so my questions are as follows.

Does backfill list all content in the blob storage, then compare with processed files in the checkpoint? For millions and millions of files, this is going to take a long time to do right?
If 1. is true, what should the way forward be ? Do you segment storage accounts into months instead?

Hope I'm making sense here 🙂

Re: Autoloader: Backfill on millions of files

-werners- — Thu, 26 Sep 2024 14:18:26 GMT

if you use backfill, I think it will check all those old files you skipped with the init.
there is the option maxFileAge but the minimum value is 14 days, and dbrx recommends 90 days.
Honestly: I would move all those old files you don't want to process to another directory (or subdir), or apply partitioning (which is in fact the same).

Re: Autoloader: Backfill on millions of files

Adam_Runarsson — Thu, 26 Sep 2024 14:35:03 GMT

Thanks for your reply!

Yeah, I was thinking along those lines, switching the directory structure around into year/month/day/type and then point my streams to the year/month folders, then once per month update my source folder - that won't play well with checkpoints though (as far as I understand it, once you start the stream with a source, the checkpoint will store that source and updating won't apply).

Perhaps something like this

Start a stream with source 2024/09/* with checkpoint /chko/2024-09
When a new month arrives, start another stream 2024/10/* with checkpoint /chko/2024-10 - Then allow some grace period before the previous stream is turned off incase some files are still missing.

It's not optimal, but perhaps the only way forward for this kind of use-case. Then for the historical backfill, I would have to have a separate stream for that I guess.

Re: Autoloader: Backfill on millions of files

-werners- — Fri, 27 Sep 2024 07:11:00 GMT

The docs are pretty sparse on the backfill process, but I think backfill won't just do a scan of the directory but will instead read the checkpoint file. That seems logical to me anyways.