topic Re: Spark Streaming Error Listing in GCS in Data Engineering

Spark Streaming Error Listing in GCS

loinguyen3182 — Fri, 20 Jun 2025 06:55:53 GMT

I have faced a problem about error listing of _delta_log, when the spark read stream with delta format in GCS.

This is the full log of the issue:

org.apache.spark.sql.streaming.StreamingQueryException: Failed to get result: java.io.IOException: Error listing gs://<bucket_name>/bronze-layer/<database_name>/<table_name>/_delta_log/. reason=Connection closed prematurely: bytesRead = 199316, Content-Length = 1010876 with message : java.io.IOException: Error listing gs://<bucket_name>/bronze-layer/<database_name>/<table_name>/_delta_log/. reason=Connection closed prematurely: bytesRead = 199316, Content-Length = 1010876

Has anyone encountered this issue? I only see it occasionally and not often., I ensure that the databricks environment grants the necessary permissions to access the GCS bucket and set log retention interval 7 days. Currently, this table is really large and have many log files in _delta_log.

On this issue, how do I solve it, and determine the root cause?

Re: Spark Streaming Error Listing in GCS

Walter_C — Tue, 24 Jun 2025 14:19:31 GMT

The key contributing factors to this issue, according to internal investigations and customer tickets, include:

Large Number of Log Files in _delta_log: Delta Lake maintains a JSON transaction log that grows with every commit. The more files present, the larger the listing operations and the greater the risk of exceeding network or cloud storage client limitations during read operations. This risk is magnified in streaming scenarios, especially at scale
Connection Closure During Metadata Listing: The underlying error (Connection closed prematurely ... bytesRead != Content-Length) points to a prematurely ended HTTP connection while downloading a large _delta_log listing result. This can be due to:
- Network instability.
- GCS forcibly closing idle or long-lived requests.
- The Hadoop GCS connector or client SDK not retrying this particular error class automatically
Client or Connector Limitations: The Databricks (and open source Hadoop) GCS connector does implement retries for many IO errors, but according to internal ticket discussions, certain cases—such as connection resets during listing—are not handled robustly, and jobs may fail unless manual retries are built in at a higher level.

Known Workarounds and Solutions

Implement Retry Logic at the Job Level: Since not all GCS IO failures are retried automatically, consider adding resilient retry or workflow restart mechanisms around your streaming job. Some customers have mitigated transient failures by catching and restarting failed jobs or adding orchestration-level retries
Reduce _delta_log File Count:
- Run Delta Lake OPTIMIZE and VACUUM periodically to compact files (though note that vacuum only cleans obsolete data files—not delta logs themselves).
- Lower the checkpoint interval (delta.checkpointInterval) so checkpoints are created more frequently, allowing Spark to skip many JSON logs and list/parse fewer files during state reconstruction.
- Increase retention intervals for logs and checkpoints (delta.logRetentionDuration, delta.checkpointRetentionDuration) to allow more aggressive cleanup, but with caution—ensure long-running queries or time travel is not needed beyond these intervals
Monitor for GCS-Side Issues: Coordinate with your cloud team to check GCS logs for bandwidth throttling, idle timeout, or TCP connection limitation indicators.

Re: Spark Streaming Error Listing in GCS

loinguyen3182 — Wed, 25 Jun 2025 03:39:39 GMT

Thank you for your recommendation,

I found a configuration of fs.gs.client.type, I current use the default, that is HTTP_API_CLIENT, If I change it to use STORAGE_CLIENT. Can this issue be solve?

I see in this document, that is gRPC is an optimized way to connect with gcs backend. It offers better latency and increased bandwidth.

Docs: hadoop-connectors/gcs/CONFIGURATION.md at master · GoogleCloudDataproc/hadoop-connectors · GitHub