Databricks Community

dvmentalmadess · ‎09-18-2023

I am trying to setup s3 as a structured streaming source. The bucket receives ~17K files/day and the original load to the bucket was ~54K files. The bucket was first loaded 3 months ago and we haven't started reading from it since. So let's say there's 1,530,054 files based on back of the napkin (54 + (17 x 30)). We are trying to find a way to reduce the number of files by increasing their size, but it isn't something we have much more control than providing file size hints and the actual files are nowhere near the hint we've specified.

We initially created the job as a streaming job in the hopes Spark would have no issues processing the files incrementally (17k / 24 ~= 700/hour). However, we can't currently process the backlog because Spark is dying w/ OOM errors.

First, I'm trying to confirm that our problem is actually due to the number of files on the initial read. Does Spark Streaming work the same way as batch does, i.e., it is failing when trying to load 1.5MM files rather than just trying to iterate over the files from S3 in order? Also, once we get the initial load done will using streaming w/ a checkpoint avoid the problem or will Spark still try to list and read all the file metadata on every run?

dvmentalmadess · ‎09-20-2023

Thanks,

We were able to make things work by increasing the driver instance size so it has more memory for the initial load. After initial load we scaled the instance down for subsequent runs. We're still testing, if we aren't able to make it work we'll try the suggestions.

Databricks Community

Structured Streaming of S3 source

Connect with Databricks Users in Your Area

South Florida Databricks User Group: Accelerate Projects to Value with GenAI

Struggling with BI? We want to hear from you!

Introducing Databricks Apps

Databricks Community Champion - September 2024 - Szymon Dybczak

Intelligent Data Engineering: Beyond the AI Hype