Hi all!
So I've been using Autoloader with File Notification mode against Azure to great success. Once past all the setup, it's rather seamless to use. I did have some issues in the beginning which is related to my question
The storage account I'm working against has 4 years worth of data in JSON files, this is a total amount of 10-15TB of data or so. I noticed that I had to set `includeExistingFiles` to false to achieve better latency, which is as expected.
However, I'm a bit worried about backfill as well, so I wanted to get more information about it, so my questions are as follows.
- Does backfill list all content in the blob storage, then compare with processed files in the checkpoint? For millions and millions of files, this is going to take a long time to do right?
- If 1. is true, what should the way forward be ? Do you segment storage accounts into months instead?
Hope I'm making sense here ๐