cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Why Spark Streaming from S3 is returning thousands of files when there are only 9?

CarterM
New Contributor III

I am attempting to stream JSON endpoint responses from an s3 bucket into a spark DLT. I have been very successful in this practice previously, but the difference this time is that I am storing the responses from multiple endpoints in the same s3 bucket, whereas before it has been a single endpoint response in a single bucket.

Setting The Scene

I am making GET requests to 9 different Season Endpoints, each for a different Sports League. The endpoints return a list of Seasons for the given Sports League. I decided to store them all in the same s3 bucket because the endpoint responses are structured almost identically.

8/9 of the endpoints return a response structured like so:

Giving a League and list of Seasons

8_9 endpoint response structure 

The 9th endpoint is for Soccer Seasons, which is structured slightly differently but it also returns a list of Seasons.

Soccer  endpoint  9 

Key endpoint differences:

  • 8/9 endpoints give a League object, Season.Type, and Season.Status
  • The 9th (Soccer) endpoint does NOT have a League object, Season.Type, nor Season.Status
    • It additionally provides a Season.Competitor_Id
  • It is also important to note, not all 8/9 endpoint responses are exactly the same and there is some variation in the provided Season fields.

The Issue:

When Spark Streaming from the s3 bucket, it says there are thousands of JSON files when there are only 9.

Here you can see all 9 in the bucket. Notice Soccer is the largest file as well.

9 endpoint responses in same s3 bucket 

However, when displaying the streaming spark data frame of the s3 bucket it shows there are thousands of null soccer JSON files while also correctly displaying the other 8 sports as having only 1 JSON file.

I suspect that the soccer endpoint response file is being auto-exploded into thousands of files. This would explain why the Seasons column for soccer is null because it has already exploded and now contains other fields.

Any guidance would be appreciated and if I can be any more clear please let me know.

Thanks in advance!

1 ACCEPTED SOLUTION

Accepted Solutions

CarterM
New Contributor III

The Soccer Seasons Endpoint Response was the only one with backslashes `\` which signify a new line character. When using AutoLoader you need to specify the `multiLine` Option to indicate the JSON spans multiple....

.option("multiline", "true")

This caused the s3 stream to interpret each section between backslashes as a separate JSON file, resulting in Thousands of null Soccer Files and no Seasons column.

JSON with backslashes causing the error

View solution in original post

4 REPLIES 4

CarterM
New Contributor III

The Soccer Seasons Endpoint Response was the only one with backslashes `\` which signify a new line character. When using AutoLoader you need to specify the `multiLine` Option to indicate the JSON spans multiple....

.option("multiline", "true")

This caused the s3 stream to interpret each section between backslashes as a separate JSON file, resulting in Thousands of null Soccer Files and no Seasons column.

JSON with backslashes causing the error

Anonymous
Not applicable

@Carter Mooring​ Thank you SO MUCH for coming back to provide a solution to your thread! Happy you were able to figure this out so quickly. And I am sure that this will help someone in the future with the same issue. 😊

williamyoung
New Contributor II

Hello Everyone,

It seems like the issue you're encountering could be related to how Spark Streaming interprets the S3 file structure, especially when dealing with multiple sources. When files from multiple endpoints are stored in the same bucket, Spark might treat each prefix or partition as a separate file set, leading to the appearance of thousands of files instead of the expected nine. To resolve this, you may want to consider restructuring your S3 bucket to have clearer, distinct directories for each endpoint. Lastly, to ensure you have the best experience and up-to-date information, I highly recommend visiting or installing the latest version of Sportzfy TV for any necessary updates. Additionally, ensure that your Spark configuration is optimized for reading from S3, such as setting appropriate options for file listing and object store consistency....Grateful!

Best of Luck!!

Hi @williamyoung ,

I must admit, that's creative way to sneak in spam 😄 Anyway, I marked it as an inappropriate content.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group