Databricks Community

Adam_Runarsson · ‎09-26-2024

Hi all!

So I've been using Autoloader with File Notification mode against Azure to great success. Once past all the setup, it's rather seamless to use. I did have some issues in the beginning which is related to my question

The storage account I'm working against has 4 years worth of data in JSON files, this is a total amount of 10-15TB of data or so. I noticed that I had to set `includeExistingFiles` to false to achieve better latency, which is as expected.

However, I'm a bit worried about backfill as well, so I wanted to get more information about it, so my questions are as follows.

Does backfill list all content in the blob storage, then compare with processed files in the checkpoint? For millions and millions of files, this is going to take a long time to do right?
If 1. is true, what should the way forward be ? Do you segment storage accounts into months instead?

Hope I'm making sense here 🙂

-werners- · ‎09-26-2024

if you use backfill, I think it will check all those old files you skipped with the init.
there is the option maxFileAge but the minimum value is 14 days, and dbrx recommends 90 days.
Honestly: I would move all those old files you don't want to process to another directory (or subdir), or apply partitioning (which is in fact the same).

Adam_Runarsson · ‎09-26-2024

Thanks for your reply!

Yeah, I was thinking along those lines, switching the directory structure around into year/month/day/type and then point my streams to the year/month folders, then once per month update my source folder - that won't play well with checkpoints though (as far as I understand it, once you start the stream with a source, the checkpoint will store that source and updating won't apply).

Perhaps something like this

Start a stream with source 2024/09/* with checkpoint /chko/2024-09
When a new month arrives, start another stream 2024/10/* with checkpoint /chko/2024-10 - Then allow some grace period before the previous stream is turned off incase some files are still missing.

It's not optimal, but perhaps the only way forward for this kind of use-case. Then for the historical backfill, I would have to have a separate stream for that I guess.

-werners- · ‎09-27-2024

The docs are pretty sparse on the backfill process, but I think backfill won't just do a scan of the directory but will instead read the checkpoint file. That seems logical to me anyways.

Databricks Community

Autoloader: Backfill on millions of files

Photos

Join Us as a Local Community Builder!

Business Intelligence in the Era of AI

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Databricks Community Champion - March 2025 - Takuya Omi

Get Started With Lakehouse Architecture | Pass a quiz to earn your certificate completion.

Virtual Learning Festival: 9 April - 30 April