cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Error: TransportResponseHandler: Still have 1 requests outstanding when connection, occurring only on large dataset.

SatheesshChinnu
New Contributor III

I am getting below error only during large dataset(i.e 15 TB compressed) . if my dataset is small( 1TB) i am not getting this error.

Look like it fails on shuffle stage. Approx number of mappers is 150,000

Spark config:

spark.sql.warehouse.dir hdfs:///user/spark/warehouse

spark.yarn.dist.files file:/etc/spark/conf/hive-site.xml

spark.executor.extraJavaOptions -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'

spark.driver.host 172.20.103.94

spark.history.fs.logDirectory hdfs:///var/log/spark/apps

spark.eventLog.enabled true

spark.ui.port 0

spark.driver.port 35246

spark.shuffle.service.enabled true

spark.driver.extraLibraryPath /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native

spark.yarn.historyServer.address ip-172-20-99-29.ec2.internal:18080

spark.yarn.app.id application_1486842541319_0002

spark.scheduler.mode FIFO

spark.driver.memory 10g

spark.executor.id driver

spark.yarn.app.container.log.dir /var/log/hadoop-yarn/containers/application_1486842541319_0002/container_1486842541319_0002_01_000001

spark.driver.extraJavaOptions -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'

spark.submit.deployMode cluster

spark.master yarn

spark.ui.filters

org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter

spark.executor.extraLibraryPath /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native

spark.sql.hive.metastore.sharedPrefixes com.amazonaws.services.dynamodbv2

spark.executor.memory 5120M

spark.driver.extraClassPath /usr/lib/hadoop-lzo/lib/:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/:/usr/share/aws/emr/emrfs/auxlib/:/usr/share/aws/ emr/security/conf:/usr/share/aws/emr/security/lib/*

spark.eventLog.dir hdfs:///var/log/spark/apps

spark.dynamicAllocation.enabled true

spark.executor.extraClassPath /usr/lib/hadoop-lzo/lib/:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/:/usr/share/aws/emr/emrfs/auxlib/:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*

spark.executor.cores 8

spark.history.ui.port 18080

spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS ip-172-20-99-29.ec2.internal

spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_URI_BASES http://ip-172-20-99-29.ec2.internal:20888/proxy/application_1486842541319_0002

spark.app.id application_1486842541319_0002

spark.hadoop.yarn.timeline-service.enabled false

spark.sql.shuffle.partitions 10000

Error Trace:

17/02/11 22:01:05 INFO ShuffleBlockFetcherIterator: Started 29 remote fetches in 2700 ms

17/02/11 22:03:04 ERROR TransportChannelHandler: Connection to ip-172-20-96-109.ec2.internal/172.20.96.109:7337 has been quiet for 120000 ms while there are outstanding requests. Assuming connection is dead; please adjust spark.network.timeout if this is wrong.

17/02/11 22:03:04 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from ip-172-20-96-109.ec2.internal/172.20.96.109:7337 is closed

17/02/11 22:03:04 ERROR OneForOneBlockFetcher: Failed while starting block fetches

java.io.IOException: Connection from ip-172-20-96-109.ec2.internal/172.20.96.109:7337 closed

at org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:128)

at org.apache.spark.network.server.TransportChannelHandler.channelInactive(TransportChannelHandler.java:109)

at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:251)

at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)

at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:230)

at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)

at io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:257)

at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:251)

at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)

at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:230)

at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)

at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:251)

at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)

at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:230)

at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)

at org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:182)

at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:251)

at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)

at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:230)

at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1289)

at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:251)

at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)

at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:893)

at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:691)

at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:408)

at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:455)

at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)

at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)

at java.lang.Thread.run(Thread.java:745)

1 ACCEPTED SOLUTION

Accepted Solutions

SatheesshChinnu
New Contributor III

I have increased my timeout to 1200s ( i.e spark.network.timeout=1200s) . Still I am getting netty error. This time error occurs on block replication.

17/02/24 09:10:21 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from ip-172-20-101-120.ec2.internal/172.20.101.120:46113 is closed
17/02/24 09:10:21 ERROR NettyBlockTransferService: Error while uploading block rdd_24_2312
java.io.IOException: Connection from ip-172-20-101-120.ec2.internal/172.20.101.120:46113 closed

View solution in original post

4 REPLIES 4

SatheesshChinnu
New Contributor III

I have increased my timeout to 1200s ( i.e spark.network.timeout=1200s) . Still I am getting netty error. This time error occurs on block replication.

17/02/24 09:10:21 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from ip-172-20-101-120.ec2.internal/172.20.101.120:46113 is closed
17/02/24 09:10:21 ERROR NettyBlockTransferService: Error while uploading block rdd_24_2312
java.io.IOException: Connection from ip-172-20-101-120.ec2.internal/172.20.101.120:46113 closed

gang_liugang_li
New Contributor II

I have also encountered this problem, it is estimated that when shuffle the network bandwidth reaches the limit and timeout. I solved this problem successfully by reducing the number of executor

srikanthvvgs
New Contributor II

I am facing the same issue , I am doing shuffle using group by key operation with very few connectors and facing the connection closed issue from one of the nodes.

parikshitbhoyar
New Contributor II

@Satheessh Chinnusamy how did you solve the above issue

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.