cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Spark Streaming Error Listing in GCS

loinguyen3182
New Contributor II

I have faced a problem about error listing of _delta_log, when the spark read stream with delta format in GCS.

 This is the full log of the issue:

org.apache.spark.sql.streaming.StreamingQueryException: Failed to get result: java.io.IOException: Error listing gs://<bucket_name>/bronze-layer/<database_name>/<table_name>/_delta_log/. reason=Connection closed prematurely: bytesRead = 199316, Content-Length = 1010876 with message : java.io.IOException: Error listing gs://<bucket_name>/bronze-layer/<database_name>/<table_name>/_delta_log/. reason=Connection closed prematurely: bytesRead = 199316, Content-Length = 1010876

Has anyone encountered this issue? I only see it occasionally and not often., I ensure that the databricks environment grants the necessary permissions to access the GCS bucket and set log retention interval 7 days. Currently, this table is really large and have many log files in _delta_log.

On this issue, how do I solve it, and determine the root cause?

 

2 REPLIES 2

Walter_C
Databricks Employee
Databricks Employee

The key contributing factors to this issue, according to internal investigations and customer tickets, include:

  • Large Number of Log Files in _delta_log: Delta Lake maintains a JSON transaction log that grows with every commit. The more files present, the larger the listing operations and the greater the risk of exceeding network or cloud storage client limitations during read operations. This risk is magnified in streaming scenarios, especially at scale

  • Connection Closure During Metadata Listing: The underlying error (Connection closed prematurely ... bytesRead != Content-Length) points to a prematurely ended HTTP connection while downloading a large _delta_log listing result. This can be due to:

    • Network instability.
    • GCS forcibly closing idle or long-lived requests.
    • The Hadoop GCS connector or client SDK not retrying this particular error class automatically
  • Client or Connector Limitations: The Databricks (and open source Hadoop) GCS connector does implement retries for many IO errors, but according to internal ticket discussions, certain casesโ€”such as connection resets during listingโ€”are not handled robustly, and jobs may fail unless manual retries are built in at a higher level.

Known Workarounds and Solutions

  • Implement Retry Logic at the Job Level: Since not all GCS IO failures are retried automatically, consider adding resilient retry or workflow restart mechanisms around your streaming job. Some customers have mitigated transient failures by catching and restarting failed jobs or adding orchestration-level retries

  • Reduce _delta_log File Count:

    • Run Delta Lake OPTIMIZE and VACUUM periodically to compact files (though note that vacuum only cleans obsolete data filesโ€”not delta logs themselves).
    • Lower the checkpoint interval (delta.checkpointInterval) so checkpoints are created more frequently, allowing Spark to skip many JSON logs and list/parse fewer files during state reconstruction.
    • Increase retention intervals for logs and checkpoints (delta.logRetentionDuration, delta.checkpointRetentionDuration) to allow more aggressive cleanup, but with cautionโ€”ensure long-running queries or time travel is not needed beyond these intervals
  • Monitor for GCS-Side Issues: Coordinate with your cloud team to check GCS logs for bandwidth throttling, idle timeout, or TCP connection limitation indicators.

Thank you for your recommendation,

I found a configuration of fs.gs.client.type, I current use the default, that is HTTP_API_CLIENTIf I change it to use STORAGE_CLIENT. Can this issue be solve?

I see in this document, that is gRPC is an optimized way to connect with gcs backend. It offers better latency and increased bandwidth.

Docs: hadoop-connectors/gcs/CONFIGURATION.md at master ยท GoogleCloudDataproc/hadoop-connectors ยท GitHub

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now