S3 connection reset error :: Removing Spark Config on Cluster

okmich — Tue, 20 Jul 2021 07:45:59 GMT

Hi guys,

I am running a production pipeline (Databricks Runtime 7.3 LTS) that keeps failing for some delta file reads with the error:

21/07/19 09:56:02 ERROR Executor: Exception in task 36.1 in stage 2.0 (TID 58)
com.databricks.sql.io.FileReadException: Error while reading file dbfs:/delta/dbname/tablename/part-00002-6df5def6-4670-4522-bed9-bcef79a172bc-c000.snappy.parquet.
at org.apache.spark.sql.execution.datasources.FileScanRDD$anon$1$anon$2.logFileNameAndThrow(FileScanRDD.scala:347)
at org.apache.spark.sql.execution.datasources.FileScanRDD$anon$1$anon$2.getNext(FileScanRDD.scala:326)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at org.apache.spark.sql.execution.datasources.FileScanRDD$anon$1.hasNext(FileScanRDD.scala:258)
at org.apache.spark.sql.execution.FileSourceScanExec$anon$1.hasNext(DataSourceScanExec.scala:716)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$anon$1.hasNext(WholeStageCodegenExec.scala:733)
at scala.collection.Iterator$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:2008)
at org.apache.spark.rdd.RDD.$anonfun$count$1(RDD.scala:1234)
at org.apache.spark.rdd.RDD.$anonfun$count$1$adapted(RDD.scala:1234)
at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2379)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144)
at org.apache.spark.scheduler.Task.run(Task.scala:117)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:640)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:643)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: javax.net.ssl.SSLException: Connection reset; Request ID: BABR4P3PP4X21SWG, Extended Request ID: SxgYnGm6XJNalP0H2c339Kq4/H7N2P8x09C/GxxMHnNwdGCnhyPlQv15SLRJ+eALsIEKRvvcbvg=, Cloud Provider: AWS, Instance ID: i-0a9a161dac10f903a
at sun.security.ssl.Alert.createSSLException(Alert.java:127)
at sun.security.ssl.TransportContext.fatal(TransportContext.java:348)
at sun.security.ssl.TransportContext.fatal(TransportContext.java:291)
at sun.security.ssl.TransportContext.fatal(TransportContext.java:286)

This error is strange because it does not occur for the same dataset when I do the same spark.read operation from a notebook. The error only occurs when it runs as part of a job. The stacktrace shows a SparkException caused by an SSLException that is caused by a SocketException.

It turns out that last week, a similar issue might have been documented in Databricks Knowledge based - https://kb.databricks.com/dbfs/s3-connection-reset-error.html

My question therefore is how do I removed the spark config as instructed in that knowledge base article

spark.hadoop.fs.s3.impl com.databricks.s3a.S3AFileSystem
spark.hadoop.fs.s3n.impl com.databricks.s3a.S3AFileSystem
spark.hadoop.fs.s3a.impl com.databricks.s3a.S3AFileSystem

Any further information on that article will be appreciated.

Regards,

topic S3 connection reset error :: Removing Spark Config on Cluster in Data Engineering

S3 connection reset error :: Removing Spark Config on Cluster