- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-11-2017 05:34 PM
I am getting below error only during large dataset(i.e 15 TB compressed) . if my dataset is small( 1TB) i am not getting this error.
Look like it fails on shuffle stage. Approx number of mappers is 150,000
Spark config:
spark.sql.warehouse.dir hdfs:///user/spark/warehousespark.yarn.dist.files file:/etc/spark/conf/hive-site.xml
spark.executor.extraJavaOptions -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'
spark.driver.host 172.20.103.94
spark.history.fs.logDirectory hdfs:///var/log/spark/apps
spark.eventLog.enabled true
spark.ui.port 0
spark.driver.port 35246
spark.shuffle.service.enabled true
spark.driver.extraLibraryPath /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
spark.yarn.historyServer.address ip-172-20-99-29.ec2.internal:18080
spark.yarn.app.id application_1486842541319_0002
spark.scheduler.mode FIFO
spark.driver.memory 10g
spark.executor.id driver
spark.yarn.app.container.log.dir /var/log/hadoop-yarn/containers/application_1486842541319_0002/container_1486842541319_0002_01_000001
spark.driver.extraJavaOptions -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'
spark.submit.deployMode cluster
spark.master yarn
spark.ui.filters
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
spark.executor.extraLibraryPath /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
spark.sql.hive.metastore.sharedPrefixes com.amazonaws.services.dynamodbv2
spark.executor.memory 5120M
spark.driver.extraClassPath /usr/lib/hadoop-lzo/lib/:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/:/usr/share/aws/emr/emrfs/auxlib/:/usr/share/aws/ emr/security/conf:/usr/share/aws/emr/security/lib/*
spark.eventLog.dir hdfs:///var/log/spark/apps
spark.dynamicAllocation.enabled true
spark.executor.extraClassPath /usr/lib/hadoop-lzo/lib/:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/:/usr/share/aws/emr/emrfs/auxlib/:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*
spark.executor.cores 8
spark.history.ui.port 18080
spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS ip-172-20-99-29.ec2.internal
spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_URI_BASES http://ip-172-20-99-29.ec2.internal:20888/proxy/application_1486842541319_0002
spark.app.id application_1486842541319_0002
spark.hadoop.yarn.timeline-service.enabled false
spark.sql.shuffle.partitions 10000
Error Trace:
17/02/11 22:01:05 INFO ShuffleBlockFetcherIterator: Started 29 remote fetches in 2700 ms
17/02/11 22:03:04 ERROR TransportChannelHandler: Connection to ip-172-20-96-109.ec2.internal/172.20.96.109:7337 has been quiet for 120000 ms while there are outstanding requests. Assuming connection is dead; please adjust spark.network.timeout if this is wrong.
17/02/11 22:03:04 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from ip-172-20-96-109.ec2.internal/172.20.96.109:7337 is closed
17/02/11 22:03:04 ERROR OneForOneBlockFetcher: Failed while starting block fetches
java.io.IOException: Connection from ip-172-20-96-109.ec2.internal/172.20.96.109:7337 closed
at org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:128)
at org.apache.spark.network.server.TransportChannelHandler.channelInactive(TransportChannelHandler.java:109)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:251)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:230)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:257)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:251)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:230)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:251)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:230)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:182)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:251)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:230)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1289)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:251)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:893)
at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:691)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:408)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:455)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:745)
- Labels:
-
Spark
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-24-2017 01:27 AM
I have increased my timeout to 1200s ( i.e spark.network.timeout=1200s) . Still I am getting netty error. This time error occurs on block replication.
17/02/24 09:10:21 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from ip-172-20-101-120.ec2.internal/172.20.101.120:46113 is closed
17/02/24 09:10:21 ERROR NettyBlockTransferService: Error while uploading block rdd_24_2312
java.io.IOException: Connection from ip-172-20-101-120.ec2.internal/172.20.101.120:46113 closed
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-24-2017 01:27 AM
I have increased my timeout to 1200s ( i.e spark.network.timeout=1200s) . Still I am getting netty error. This time error occurs on block replication.
17/02/24 09:10:21 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from ip-172-20-101-120.ec2.internal/172.20.101.120:46113 is closed
17/02/24 09:10:21 ERROR NettyBlockTransferService: Error while uploading block rdd_24_2312
java.io.IOException: Connection from ip-172-20-101-120.ec2.internal/172.20.101.120:46113 closed
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-20-2017 10:29 PM
I have also encountered this problem, it is estimated that when shuffle the network bandwidth reaches the limit and timeout. I solved this problem successfully by reducing the number of executor
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-01-2018 04:19 PM
I am facing the same issue , I am doing shuffle using group by key operation with very few connectors and facing the connection closed issue from one of the nodes.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-03-2018 02:20 AM
@Satheessh Chinnusamy how did you solve the above issue

