Hey everyone,
I’m trying to clarify a confusion in AutoLoader regarding trigger batches and micro-batches when using .forEachBatch.
Here’s what I understand so far:
Trigger batch – Controlled by cloudFiles.maxFilesPerTrigger and cloudFiles.maxBytesPerTrigger. This determines how many new files Auto Loader reads per streaming trigger.
Micro-batch in .forEachBatch – This is the batch of data your callback function receives.
My questions are:
1. Are trigger batches and .forEachBatch micro-batches exactly the same thing?
2. If they are not the same, do they map one on one? For example, if I have maxFilesPerTrigger=10, does each .forEachBatch call always receive exactly 10 files (if available), or could it receive more or fewer depending on internal Spark scheduling?
3. Can I set the .forEachBatch microbatch size, just as I set the trigger size, or it's internal Spark configuration?
4. Does the trigger type affects any of the upper responses (availableNow, time-scheduled trigger, real-time streaming)?
5. Any suggestions to keep in mind for initial (historic) load?