<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Why Spark Streaming from S3 is returning thousands of files when there are only 9? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/why-spark-streaming-from-s3-is-returning-thousands-of-files-when/m-p/30147#M21825</link>
    <description>&lt;P&gt;I am attempting to stream JSON endpoint responses from an s3 bucket into a spark DLT. I have been very successful in this practice previously, but the difference this time is that I am storing the responses from multiple endpoints in the same s3 bucket, whereas before it has been a single endpoint response in a single bucket. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;Setting The Scene&lt;/B&gt;&lt;/P&gt;&lt;P&gt;I am making GET requests to 9 different Season Endpoints, each for a different Sports League. The endpoints return a list of Seasons for the given Sports League. I decided to store them all in the same s3 bucket because the endpoint responses are structured almost identically.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;8/9 of the endpoints return a response structured like so:&lt;/P&gt;&lt;P&gt;Giving a League and list of Seasons&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="8_9 endpoint response structure"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1433iD190260A6BA34F43/image-size/large?v=v2&amp;amp;px=999" role="button" title="8_9 endpoint response structure" alt="8_9 endpoint response structure" /&gt;&lt;/span&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The 9th endpoint is for Soccer Seasons, which is structured slightly differently but it also returns a list of Seasons.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="Soccer  endpoint  9"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1425iE7AC1B55B4014F0E/image-size/large?v=v2&amp;amp;px=999" role="button" title="Soccer  endpoint  9" alt="Soccer  endpoint  9" /&gt;&lt;/span&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;B&gt;Key endpoint differences:&lt;/B&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;8/9 endpoints give a League object, Season.Type, and Season.Status&lt;/LI&gt;&lt;LI&gt;The 9th (Soccer) endpoint does NOT have a League object, Season.Type, nor Season.Status&lt;UL&gt;&lt;LI&gt;It additionally provides a Season.Competitor_Id&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;It is also important to note, not all 8/9 endpoint responses are exactly the same and there is some variation in the provided Season fields.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;The Issue:&lt;/B&gt;&lt;/P&gt;&lt;P&gt;When Spark Streaming from the s3 bucket, it says there are thousands of JSON files when there are only 9.&lt;/P&gt;&lt;P&gt;Here you can see all 9 in the bucket. Notice Soccer is the largest file as well.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="9 endpoint responses in same s3 bucket"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1441i0E8DC28FD1A0D897/image-size/large?v=v2&amp;amp;px=999" role="button" title="9 endpoint responses in same s3 bucket" alt="9 endpoint responses in same s3 bucket" /&gt;&lt;/span&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;However, when displaying the streaming spark data frame of the s3 bucket it shows there are thousands of null soccer JSON files while also correctly displaying the other 8 sports as having only 1 JSON file. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I suspect that the soccer endpoint response file is being auto-exploded into thousands of files. This would explain why the Seasons column for soccer is null because it has already exploded and now contains other fields.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Any guidance would be appreciated and if I can be any more clear please let me know.&lt;/P&gt;&lt;P&gt;Thanks in advance!&lt;/P&gt;</description>
    <pubDate>Thu, 29 Sep 2022 23:46:56 GMT</pubDate>
    <dc:creator>CarterM</dc:creator>
    <dc:date>2022-09-29T23:46:56Z</dc:date>
    <item>
      <title>Why Spark Streaming from S3 is returning thousands of files when there are only 9?</title>
      <link>https://community.databricks.com/t5/data-engineering/why-spark-streaming-from-s3-is-returning-thousands-of-files-when/m-p/30147#M21825</link>
      <description>&lt;P&gt;I am attempting to stream JSON endpoint responses from an s3 bucket into a spark DLT. I have been very successful in this practice previously, but the difference this time is that I am storing the responses from multiple endpoints in the same s3 bucket, whereas before it has been a single endpoint response in a single bucket. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;Setting The Scene&lt;/B&gt;&lt;/P&gt;&lt;P&gt;I am making GET requests to 9 different Season Endpoints, each for a different Sports League. The endpoints return a list of Seasons for the given Sports League. I decided to store them all in the same s3 bucket because the endpoint responses are structured almost identically.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;8/9 of the endpoints return a response structured like so:&lt;/P&gt;&lt;P&gt;Giving a League and list of Seasons&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="8_9 endpoint response structure"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1433iD190260A6BA34F43/image-size/large?v=v2&amp;amp;px=999" role="button" title="8_9 endpoint response structure" alt="8_9 endpoint response structure" /&gt;&lt;/span&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The 9th endpoint is for Soccer Seasons, which is structured slightly differently but it also returns a list of Seasons.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="Soccer  endpoint  9"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1425iE7AC1B55B4014F0E/image-size/large?v=v2&amp;amp;px=999" role="button" title="Soccer  endpoint  9" alt="Soccer  endpoint  9" /&gt;&lt;/span&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;B&gt;Key endpoint differences:&lt;/B&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;8/9 endpoints give a League object, Season.Type, and Season.Status&lt;/LI&gt;&lt;LI&gt;The 9th (Soccer) endpoint does NOT have a League object, Season.Type, nor Season.Status&lt;UL&gt;&lt;LI&gt;It additionally provides a Season.Competitor_Id&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;It is also important to note, not all 8/9 endpoint responses are exactly the same and there is some variation in the provided Season fields.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;The Issue:&lt;/B&gt;&lt;/P&gt;&lt;P&gt;When Spark Streaming from the s3 bucket, it says there are thousands of JSON files when there are only 9.&lt;/P&gt;&lt;P&gt;Here you can see all 9 in the bucket. Notice Soccer is the largest file as well.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="9 endpoint responses in same s3 bucket"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1441i0E8DC28FD1A0D897/image-size/large?v=v2&amp;amp;px=999" role="button" title="9 endpoint responses in same s3 bucket" alt="9 endpoint responses in same s3 bucket" /&gt;&lt;/span&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;However, when displaying the streaming spark data frame of the s3 bucket it shows there are thousands of null soccer JSON files while also correctly displaying the other 8 sports as having only 1 JSON file. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I suspect that the soccer endpoint response file is being auto-exploded into thousands of files. This would explain why the Seasons column for soccer is null because it has already exploded and now contains other fields.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Any guidance would be appreciated and if I can be any more clear please let me know.&lt;/P&gt;&lt;P&gt;Thanks in advance!&lt;/P&gt;</description>
      <pubDate>Thu, 29 Sep 2022 23:46:56 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/why-spark-streaming-from-s3-is-returning-thousands-of-files-when/m-p/30147#M21825</guid>
      <dc:creator>CarterM</dc:creator>
      <dc:date>2022-09-29T23:46:56Z</dc:date>
    </item>
    <item>
      <title>Re: Why Spark Streaming from S3 is returning thousands of files when there are only 9?</title>
      <link>https://community.databricks.com/t5/data-engineering/why-spark-streaming-from-s3-is-returning-thousands-of-files-when/m-p/30148#M21826</link>
      <description>&lt;P&gt;The Soccer Seasons Endpoint Response was the only one with backslashes `\` which signify a new line character. &lt;A href="https://docs.databricks.com/ingestion/auto-loader/options.html?_ga=2.221493854.717169095.1664489763-2072853648.1650479458" alt="https://docs.databricks.com/ingestion/auto-loader/options.html?_ga=2.221493854.717169095.1664489763-2072853648.1650479458" target="_blank"&gt;When using AutoLoader you need to specify the `multiLine` Option to indicate the JSON spans multiple lines&lt;/A&gt;.&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;.option("multiline", "true")&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;This caused the s3 stream to interpret each section between backslashes as a separate JSON file, resulting in Thousands of null Soccer Files and no Seasons column.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="JSON with backslashes causing the error"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1430i4E656F976703F039/image-size/large?v=v2&amp;amp;px=999" role="button" title="JSON with backslashes causing the error" alt="JSON with backslashes causing the error" /&gt;&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Sat, 01 Oct 2022 00:23:05 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/why-spark-streaming-from-s3-is-returning-thousands-of-files-when/m-p/30148#M21826</guid>
      <dc:creator>CarterM</dc:creator>
      <dc:date>2022-10-01T00:23:05Z</dc:date>
    </item>
    <item>
      <title>Re: Why Spark Streaming from S3 is returning thousands of files when there are only 9?</title>
      <link>https://community.databricks.com/t5/data-engineering/why-spark-streaming-from-s3-is-returning-thousands-of-files-when/m-p/30149#M21827</link>
      <description>&lt;P&gt;@Carter Mooring​&amp;nbsp;Thank you SO MUCH for coming back to provide a solution to your thread! Happy you were able to figure this out so quickly. And I am sure that this will help someone in the future with the same issue. &lt;span class="lia-unicode-emoji" title=":smiling_face_with_smiling_eyes:"&gt;😊&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 03 Oct 2022 22:04:39 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/why-spark-streaming-from-s3-is-returning-thousands-of-files-when/m-p/30149#M21827</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2022-10-03T22:04:39Z</dc:date>
    </item>
    <item>
      <title>Re: Why Spark Streaming from S3 is returning thousands of files when there are only 9?</title>
      <link>https://community.databricks.com/t5/data-engineering/why-spark-streaming-from-s3-is-returning-thousands-of-files-when/m-p/84150#M37139</link>
      <description>&lt;P&gt;Hi &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/117305"&gt;@williamyoung&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;I must admit, that's creative way to sneak in spam &lt;span class="lia-unicode-emoji" title=":grinning_face_with_smiling_eyes:"&gt;😄&lt;/span&gt; Anyway, I marked it as an inappropriate content.&lt;/P&gt;</description>
      <pubDate>Sat, 24 Aug 2024 11:54:38 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/why-spark-streaming-from-s3-is-returning-thousands-of-files-when/m-p/84150#M37139</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2024-08-24T11:54:38Z</dc:date>
    </item>
  </channel>
</rss>

