For a production work load containing around 15k gzip compressed json files per hour all in a YYYY/MM/DD/HH/id/timestamp.json.gz directory
What would be the better approach on ingesting this into a delta table in terms of not only the incremental loads, but also reprocessing?
I've so far tried both directory listing and event notification methods through autoloader and event notifications do seem quicker on incremental loads though i'm not sure about not guaranteeing 100% delivery SLA (on this later), but both are tragically slow in reprocessing these kind of workloads.
With event notifications ingesting 15k files per hour and with daily runs accumulating 360k files, could some be missed by event notifications? I've seen an option for backfilling the data at an interval on these notifications, but this comes back to directory listing the entire directory so not sure if re-architecturing how the files drop would help the autoloader at all?