<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Spark Streaming Error Listing in GCS in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/spark-streaming-error-listing-in-gcs/m-p/122695#M46842</link>
    <description>&lt;P class="_1t7bu9h1 paragraph"&gt;The key contributing factors to this issue, according to internal investigations and customer tickets, include:&lt;/P&gt;
&lt;UL class="_1t7bu9h6 _1t7bu9h2"&gt;
&lt;LI class="_1t7bu9h9"&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;&lt;STRONG&gt;Large Number of Log Files in _delta_log&lt;/STRONG&gt;: Delta Lake maintains a JSON transaction log that grows with every commit. The more files present, the larger the listing operations and the greater the risk of exceeding network or cloud storage client limitations during read operations. This risk is magnified in streaming scenarios, especially at scale&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="_1t7bu9h9"&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;&lt;STRONG&gt;Connection Closure During Metadata Listing&lt;/STRONG&gt;: The underlying error (&lt;CODE&gt;Connection closed prematurely ... bytesRead != Content-Length&lt;/CODE&gt;) points to a prematurely ended HTTP connection while downloading a large &lt;CODE&gt;_delta_log&lt;/CODE&gt; listing result. This can be due to:&lt;/P&gt;
&lt;UL class="_1t7bu9h7 _1t7bu9h2"&gt;
&lt;LI class="_1t7bu9h9"&gt;Network instability.&lt;/LI&gt;
&lt;LI class="_1t7bu9h9"&gt;GCS forcibly closing idle or long-lived requests.&lt;/LI&gt;
&lt;LI class="_1t7bu9h9"&gt;The Hadoop GCS connector or client SDK not retrying this particular error class automatically&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;LI class="_1t7bu9h9"&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;&lt;STRONG&gt;Client or Connector Limitations&lt;/STRONG&gt;: The Databricks (and open source Hadoop) GCS connector does implement retries for many IO errors, but according to internal ticket discussions, certain cases—such as connection resets during listing—are not handled robustly, and jobs may fail unless manual retries are built in at a higher level.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 class="_1jeaq5e0 _1t7bu9hb y728l9aj heading3 _1jeaq5e1"&gt;Known Workarounds and Solutions&lt;/H3&gt;
&lt;UL class="_1t7bu9h6 _1t7bu9h2"&gt;
&lt;LI class="_1t7bu9h9"&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;&lt;STRONG&gt;Implement Retry Logic at the Job Level&lt;/STRONG&gt;: Since not all GCS IO failures are retried automatically, consider adding resilient retry or workflow restart mechanisms around your streaming job. Some customers have mitigated transient failures by catching and restarting failed jobs or adding orchestration-level retries&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="_1t7bu9h9"&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;&lt;STRONG&gt;Reduce _delta_log File Count&lt;/STRONG&gt;:&lt;/P&gt;
&lt;UL class="_1t7bu9h7 _1t7bu9h2"&gt;
&lt;LI class="_1t7bu9h9"&gt;Run Delta Lake OPTIMIZE and VACUUM periodically to compact files (though note that vacuum only cleans obsolete data files—not delta logs themselves).&lt;/LI&gt;
&lt;LI class="_1t7bu9h9"&gt;Lower the checkpoint interval (&lt;CODE&gt;delta.checkpointInterval&lt;/CODE&gt;) so checkpoints are created more frequently, allowing Spark to skip many JSON logs and list/parse fewer files during state reconstruction.&lt;/LI&gt;
&lt;LI class="_1t7bu9h9"&gt;Increase retention intervals for logs and checkpoints (&lt;CODE&gt;delta.logRetentionDuration&lt;/CODE&gt;, &lt;CODE&gt;delta.checkpointRetentionDuration&lt;/CODE&gt;) to allow more aggressive cleanup, but with caution—ensure long-running queries or time travel is not needed beyond these intervals&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;LI class="_1t7bu9h9"&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;&lt;STRONG&gt;Monitor for GCS-Side Issues&lt;/STRONG&gt;: Coordinate with your cloud team to check GCS logs for bandwidth throttling, idle timeout, or TCP connection limitation indicators.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;</description>
    <pubDate>Tue, 24 Jun 2025 14:19:31 GMT</pubDate>
    <dc:creator>Walter_C</dc:creator>
    <dc:date>2025-06-24T14:19:31Z</dc:date>
    <item>
      <title>Spark Streaming Error Listing in GCS</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-streaming-error-listing-in-gcs/m-p/122315#M46740</link>
      <description>&lt;P&gt;I have faced a problem about error listing of &lt;STRONG&gt;_delta_log,&lt;/STRONG&gt; when the spark read stream with delta format in GCS.&lt;/P&gt;&lt;P&gt;&amp;nbsp;This is the full log of the issue:&lt;/P&gt;&lt;P&gt;org.apache.spark.sql.streaming.StreamingQueryException: Failed to get result: java.io.IOException: Error listing gs://&amp;lt;bucket_name&amp;gt;/bronze-layer/&amp;lt;database_name&amp;gt;/&amp;lt;table_name&amp;gt;/_delta_log/. reason=Connection closed prematurely: bytesRead = 199316, Content-Length = 1010876 with message : java.io.IOException: Error listing gs://&amp;lt;bucket_name&amp;gt;/bronze-layer/&amp;lt;database_name&amp;gt;/&amp;lt;table_name&amp;gt;/_delta_log/. reason=Connection closed prematurely: bytesRead = 199316, Content-Length = 1010876&lt;/P&gt;&lt;P&gt;Has anyone encountered this issue? I only see it occasionally and not often., I ensure that the databricks environment&amp;nbsp;&lt;SPAN&gt;grants the necessary permissions to access the GCS bucket and set log retention interval 7 days. Currently, this table is really large and have many log files in _delta_log.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;On this issue, how do I solve it, and determine the root cause?&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 20 Jun 2025 06:55:53 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-streaming-error-listing-in-gcs/m-p/122315#M46740</guid>
      <dc:creator>loinguyen3182</dc:creator>
      <dc:date>2025-06-20T06:55:53Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Streaming Error Listing in GCS</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-streaming-error-listing-in-gcs/m-p/122695#M46842</link>
      <description>&lt;P class="_1t7bu9h1 paragraph"&gt;The key contributing factors to this issue, according to internal investigations and customer tickets, include:&lt;/P&gt;
&lt;UL class="_1t7bu9h6 _1t7bu9h2"&gt;
&lt;LI class="_1t7bu9h9"&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;&lt;STRONG&gt;Large Number of Log Files in _delta_log&lt;/STRONG&gt;: Delta Lake maintains a JSON transaction log that grows with every commit. The more files present, the larger the listing operations and the greater the risk of exceeding network or cloud storage client limitations during read operations. This risk is magnified in streaming scenarios, especially at scale&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="_1t7bu9h9"&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;&lt;STRONG&gt;Connection Closure During Metadata Listing&lt;/STRONG&gt;: The underlying error (&lt;CODE&gt;Connection closed prematurely ... bytesRead != Content-Length&lt;/CODE&gt;) points to a prematurely ended HTTP connection while downloading a large &lt;CODE&gt;_delta_log&lt;/CODE&gt; listing result. This can be due to:&lt;/P&gt;
&lt;UL class="_1t7bu9h7 _1t7bu9h2"&gt;
&lt;LI class="_1t7bu9h9"&gt;Network instability.&lt;/LI&gt;
&lt;LI class="_1t7bu9h9"&gt;GCS forcibly closing idle or long-lived requests.&lt;/LI&gt;
&lt;LI class="_1t7bu9h9"&gt;The Hadoop GCS connector or client SDK not retrying this particular error class automatically&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;LI class="_1t7bu9h9"&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;&lt;STRONG&gt;Client or Connector Limitations&lt;/STRONG&gt;: The Databricks (and open source Hadoop) GCS connector does implement retries for many IO errors, but according to internal ticket discussions, certain cases—such as connection resets during listing—are not handled robustly, and jobs may fail unless manual retries are built in at a higher level.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 class="_1jeaq5e0 _1t7bu9hb y728l9aj heading3 _1jeaq5e1"&gt;Known Workarounds and Solutions&lt;/H3&gt;
&lt;UL class="_1t7bu9h6 _1t7bu9h2"&gt;
&lt;LI class="_1t7bu9h9"&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;&lt;STRONG&gt;Implement Retry Logic at the Job Level&lt;/STRONG&gt;: Since not all GCS IO failures are retried automatically, consider adding resilient retry or workflow restart mechanisms around your streaming job. Some customers have mitigated transient failures by catching and restarting failed jobs or adding orchestration-level retries&lt;/P&gt;
&lt;/LI&gt;
&lt;LI class="_1t7bu9h9"&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;&lt;STRONG&gt;Reduce _delta_log File Count&lt;/STRONG&gt;:&lt;/P&gt;
&lt;UL class="_1t7bu9h7 _1t7bu9h2"&gt;
&lt;LI class="_1t7bu9h9"&gt;Run Delta Lake OPTIMIZE and VACUUM periodically to compact files (though note that vacuum only cleans obsolete data files—not delta logs themselves).&lt;/LI&gt;
&lt;LI class="_1t7bu9h9"&gt;Lower the checkpoint interval (&lt;CODE&gt;delta.checkpointInterval&lt;/CODE&gt;) so checkpoints are created more frequently, allowing Spark to skip many JSON logs and list/parse fewer files during state reconstruction.&lt;/LI&gt;
&lt;LI class="_1t7bu9h9"&gt;Increase retention intervals for logs and checkpoints (&lt;CODE&gt;delta.logRetentionDuration&lt;/CODE&gt;, &lt;CODE&gt;delta.checkpointRetentionDuration&lt;/CODE&gt;) to allow more aggressive cleanup, but with caution—ensure long-running queries or time travel is not needed beyond these intervals&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;LI class="_1t7bu9h9"&gt;
&lt;P class="_1t7bu9h1 paragraph"&gt;&lt;STRONG&gt;Monitor for GCS-Side Issues&lt;/STRONG&gt;: Coordinate with your cloud team to check GCS logs for bandwidth throttling, idle timeout, or TCP connection limitation indicators.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;</description>
      <pubDate>Tue, 24 Jun 2025 14:19:31 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-streaming-error-listing-in-gcs/m-p/122695#M46842</guid>
      <dc:creator>Walter_C</dc:creator>
      <dc:date>2025-06-24T14:19:31Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Streaming Error Listing in GCS</title>
      <link>https://community.databricks.com/t5/data-engineering/spark-streaming-error-listing-in-gcs/m-p/122754#M46858</link>
      <description>&lt;P&gt;Thank you for your recommendation,&lt;/P&gt;&lt;P&gt;I found a configuration of&amp;nbsp;&lt;STRONG&gt;fs.gs.client.type,&amp;nbsp;&lt;/STRONG&gt;I current use the default, that is&amp;nbsp;HTTP_API_CLIENT&lt;STRONG&gt;,&amp;nbsp;&lt;/STRONG&gt;If I change it to use&amp;nbsp;STORAGE_CLIENT. Can this issue be solve?&lt;/P&gt;&lt;P&gt;I see in this document, that is&amp;nbsp;&lt;SPAN&gt;gRPC is an optimized way to connect with gcs backend. It offers better latency and increased bandwidth.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Docs:&amp;nbsp;&lt;A href="https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md" target="_blank" rel="noopener"&gt;hadoop-connectors/gcs/CONFIGURATION.md at master · GoogleCloudDataproc/hadoop-connectors · GitHub&lt;/A&gt;&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 25 Jun 2025 03:39:39 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/spark-streaming-error-listing-in-gcs/m-p/122754#M46858</guid>
      <dc:creator>loinguyen3182</dc:creator>
      <dc:date>2025-06-25T03:39:39Z</dc:date>
    </item>
  </channel>
</rss>

