cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

Structured Streaming of S3 source

dvmentalmadess
Valued Contributor

I am trying to setup s3 as a structured streaming source. The bucket receives ~17K files/day and the original load to the bucket was ~54K files. The bucket was first loaded 3 months ago and we haven't started reading from it since. So let's say there's 1,530,054 files based on back of the napkin (54 + (17 x 30)). We are trying to find a way to reduce the number of files by increasing their size, but it isn't something we have much more control than providing file size hints and the actual files are nowhere near the hint we've specified.

We initially created the job as a streaming job in the hopes Spark would have no issues processing the files incrementally (17k / 24 ~= 700/hour). However, we can't currently process the backlog because Spark is dying w/ OOM errors.

First, I'm trying to confirm that our problem is actually due to the number of files on the initial read. Does Spark Streaming work the same way as batch does, i.e., it is failing when trying to load 1.5MM files rather than just trying to iterate over the files from S3 in order? Also, once we get the initial load done will using streaming w/ a checkpoint avoid the problem or will Spark still try to list and read all the file metadata on every run?

1 REPLY 1

dvmentalmadess
Valued Contributor

Thanks,

We were able to make things work by increasing the driver instance size so it has more memory for the initial load. After initial load we scaled the instance down for subsequent runs. We're still testing, if we aren't able to make it work we'll try the suggestions.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now