cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Error: TransportResponseHandler: Still have 1 requests outstanding when connection, occurring only on large dataset.

SatheesshChinnu
New Contributor III

I am getting below error only during large dataset(i.e 15 TB compressed) . if my dataset is small( 1TB) i am not getting this error.

Look like it fails on shuffle stage. Approx number of mappers is 150,000

Spark config:

spark.sql.warehouse.dir hdfs:///user/spark/warehouse

spark.yarn.dist.files file:/etc/spark/conf/hive-site.xml

spark.executor.extraJavaOptions -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'

spark.driver.host 172.20.103.94

spark.history.fs.logDirectory hdfs:///var/log/spark/apps

spark.eventLog.enabled true

spark.ui.port 0

spark.driver.port 35246

spark.shuffle.service.enabled true

spark.driver.extraLibraryPath /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native

spark.yarn.historyServer.address ip-172-20-99-29.ec2.internal:18080

spark.yarn.app.id application_1486842541319_0002

spark.scheduler.mode FIFO

spark.driver.memory 10g

spark.executor.id driver

spark.yarn.app.container.log.dir /var/log/hadoop-yarn/containers/application_1486842541319_0002/container_1486842541319_0002_01_000001

spark.driver.extraJavaOptions -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'

spark.submit.deployMode cluster

spark.master yarn

spark.ui.filters

org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter

spark.executor.extraLibraryPath /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native

spark.sql.hive.metastore.sharedPrefixes com.amazonaws.services.dynamodbv2

spark.executor.memory 5120M

spark.driver.extraClassPath /usr/lib/hadoop-lzo/lib/:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/:/usr/share/aws/emr/emrfs/auxlib/:/usr/share/aws/ emr/security/conf:/usr/share/aws/emr/security/lib/*

spark.eventLog.dir hdfs:///var/log/spark/apps

spark.dynamicAllocation.enabled true

spark.executor.extraClassPath /usr/lib/hadoop-lzo/lib/:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/:/usr/share/aws/emr/emrfs/auxlib/:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*

spark.executor.cores 8

spark.history.ui.port 18080

spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS ip-172-20-99-29.ec2.internal

spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_URI_BASES http://ip-172-20-99-29.ec2.internal:20888/proxy/application_1486842541319_0002

spark.app.id application_1486842541319_0002

spark.hadoop.yarn.timeline-service.enabled false

spark.sql.shuffle.partitions 10000

Error Trace:

17/02/11 22:01:05 INFO ShuffleBlockFetcherIterator: Started 29 remote fetches in 2700 ms

17/02/11 22:03:04 ERROR TransportChannelHandler: Connection to ip-172-20-96-109.ec2.internal/172.20.96.109:7337 has been quiet for 120000 ms while there are outstanding requests. Assuming connection is dead; please adjust spark.network.timeout if this is wrong.

17/02/11 22:03:04 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from ip-172-20-96-109.ec2.internal/172.20.96.109:7337 is closed

17/02/11 22:03:04 ERROR OneForOneBlockFetcher: Failed while starting block fetches

java.io.IOException: Connection from ip-172-20-96-109.ec2.internal/172.20.96.109:7337 closed

at org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:128)

at org.apache.spark.network.server.TransportChannelHandler.channelInactive(TransportChannelHandler.java:109)

at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:251)

at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)

at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:230)

at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)

at io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:257)

at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:251)

at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)

at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:230)

at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)

at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:251)

at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)

at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:230)

at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)

at org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:182)

at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:251)

at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)

at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:230)

at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1289)

at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:251)

at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)

at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:893)

at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:691)

at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:408)

at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:455)

at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)

at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)

at java.lang.Thread.run(Thread.java:745)

1 ACCEPTED SOLUTION

Accepted Solutions

SatheesshChinnu
New Contributor III

I have increased my timeout to 1200s ( i.e spark.network.timeout=1200s) . Still I am getting netty error. This time error occurs on block replication.

17/02/24 09:10:21 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from ip-172-20-101-120.ec2.internal/172.20.101.120:46113 is closed
17/02/24 09:10:21 ERROR NettyBlockTransferService: Error while uploading block rdd_24_2312
java.io.IOException: Connection from ip-172-20-101-120.ec2.internal/172.20.101.120:46113 closed

View solution in original post

4 REPLIES 4

SatheesshChinnu
New Contributor III

I have increased my timeout to 1200s ( i.e spark.network.timeout=1200s) . Still I am getting netty error. This time error occurs on block replication.

17/02/24 09:10:21 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from ip-172-20-101-120.ec2.internal/172.20.101.120:46113 is closed
17/02/24 09:10:21 ERROR NettyBlockTransferService: Error while uploading block rdd_24_2312
java.io.IOException: Connection from ip-172-20-101-120.ec2.internal/172.20.101.120:46113 closed

gang_liugang_li
New Contributor II

I have also encountered this problem, it is estimated that when shuffle the network bandwidth reaches the limit and timeout. I solved this problem successfully by reducing the number of executor

srikanthvvgs
New Contributor II

I am facing the same issue , I am doing shuffle using group by key operation with very few connectors and facing the connection closed issue from one of the nodes.

parikshitbhoyar
New Contributor II

@Satheessh Chinnusamy how did you solve the above issue

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group