Autoloader: Backfill on millions of files
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-26-2024 04:11 AM
Hi all!
So I've been using Autoloader with File Notification mode against Azure to great success. Once past all the setup, it's rather seamless to use. I did have some issues in the beginning which is related to my question
The storage account I'm working against has 4 years worth of data in JSON files, this is a total amount of 10-15TB of data or so. I noticed that I had to set `includeExistingFiles` to false to achieve better latency, which is as expected.
However, I'm a bit worried about backfill as well, so I wanted to get more information about it, so my questions are as follows.
- Does backfill list all content in the blob storage, then compare with processed files in the checkpoint? For millions and millions of files, this is going to take a long time to do right?
- If 1. is true, what should the way forward be ? Do you segment storage accounts into months instead?
Hope I'm making sense here 🙂
- Labels:
-
Spark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-26-2024 07:18 AM
if you use backfill, I think it will check all those old files you skipped with the init.
there is the option maxFileAge but the minimum value is 14 days, and dbrx recommends 90 days.
Honestly: I would move all those old files you don't want to process to another directory (or subdir), or apply partitioning (which is in fact the same).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-26-2024 07:35 AM
Thanks for your reply!
Yeah, I was thinking along those lines, switching the directory structure around into year/month/day/type and then point my streams to the year/month folders, then once per month update my source folder - that won't play well with checkpoints though (as far as I understand it, once you start the stream with a source, the checkpoint will store that source and updating won't apply).
Perhaps something like this
- Start a stream with source 2024/09/* with checkpoint /chko/2024-09
- When a new month arrives, start another stream 2024/10/* with checkpoint /chko/2024-10 - Then allow some grace period before the previous stream is turned off incase some files are still missing.
It's not optimal, but perhaps the only way forward for this kind of use-case. Then for the historical backfill, I would have to have a separate stream for that I guess.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-27-2024 12:11 AM
The docs are pretty sparse on the backfill process, but I think backfill won't just do a scan of the directory but will instead read the checkpoint file. That seems logical to me anyways.

