cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Autoloader: Backfill on millions of files

Adam_Runarsson
New Contributor II

Hi all!

So I've been using Autoloader with File Notification mode against Azure to great success. Once past all the setup, it's rather seamless to use. I did have some issues in the beginning which is related to my question

The storage account I'm working against has 4 years worth of data in JSON files, this is a total amount of 10-15TB of data or so. I noticed that I had to set `includeExistingFiles` to false to achieve better latency, which is as expected.

However, I'm a bit worried about backfill as well, so I wanted to get more information about it, so my questions are as follows.

  1. Does backfill list all content in the blob storage, then compare with processed files in the checkpoint? For millions and millions of files, this is going to take a long time to do right?
  2. If 1. is true, what should the way forward be ? Do you segment storage accounts into months instead?

Hope I'm making sense here 🙂

3 REPLIES 3

-werners-
Esteemed Contributor III

if you use backfill, I think it will check all those old files you skipped with the init.
there is the option maxFileAge  but the minimum value is 14 days, and dbrx recommends 90 days.
Honestly: I would move all those old files you don't want to process to another directory (or subdir), or apply partitioning (which is in fact the same).

Thanks for your reply!

Yeah, I was thinking along those lines, switching the directory structure around into year/month/day/type and then point my streams to the year/month folders, then once per month update my source folder - that won't play well with checkpoints though (as far as I understand it, once you start the stream with a source, the checkpoint will store that source and updating won't apply). 

Perhaps something like this

  • Start a stream with source 2024/09/* with checkpoint /chko/2024-09
  • When a new month arrives, start another stream 2024/10/*  with checkpoint /chko/2024-10 - Then allow some grace period before the previous stream is turned off incase some files are still missing.

It's not optimal, but perhaps the only way forward for this kind of use-case. Then for the historical backfill, I would have to have a separate stream for that I guess.

-werners-
Esteemed Contributor III

The docs are pretty sparse on the backfill process, but I think backfill won't just do a scan of the directory but will instead read the checkpoint file.  That seems logical to me anyways.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group