I am trying to setup s3 as a structured streaming source. The bucket receives ~17K files/day and the original load to the bucket was ~54K files. The bucket was first loaded 3 months ago and we haven't started reading from it since. So let's say there's 1,530,054 files based on back of the napkin (54 + (17 x 30)). We are trying to find a way to reduce the number of files by increasing their size, but it isn't something we have much more control than providing file size hints and the actual files are nowhere near the hint we've specified.
We initially created the job as a streaming job in the hopes Spark would have no issues processing the files incrementally (17k / 24 ~= 700/hour). However, we can't currently process the backlog because Spark is dying w/ OOM errors.
First, I'm trying to confirm that our problem is actually due to the number of files on the initial read. Does Spark Streaming work the same way as batch does, i.e., it is failing when trying to load 1.5MM files rather than just trying to iterate over the files from S3 in order? Also, once we get the initial load done will using streaming w/ a checkpoint avoid the problem or will Spark still try to list and read all the file metadata on every run?