02-11-2017 05:34 PM
I am getting below error only during large dataset(i.e 15 TB compressed) . if my dataset is small( 1TB) i am not getting this error.
Look like it fails on shuffle stage. Approx number of mappers is 150,000
Spark config:
spark.sql.warehouse.dir hdfs:///user/spark/warehousespark.yarn.dist.files file:/etc/spark/conf/hive-site.xml
spark.executor.extraJavaOptions -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'
spark.driver.host 172.20.103.94
spark.history.fs.logDirectory hdfs:///var/log/spark/apps
spark.eventLog.enabled true
spark.ui.port 0
spark.driver.port 35246
spark.shuffle.service.enabled true
spark.driver.extraLibraryPath /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
spark.yarn.historyServer.address ip-172-20-99-29.ec2.internal:18080
spark.yarn.app.id application_1486842541319_0002
spark.scheduler.mode FIFO
spark.driver.memory 10g
spark.executor.id driver
spark.yarn.app.container.log.dir /var/log/hadoop-yarn/containers/application_1486842541319_0002/container_1486842541319_0002_01_000001
spark.driver.extraJavaOptions -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError='kill -9 %p'
spark.submit.deployMode cluster
spark.master yarn
spark.ui.filters
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
spark.executor.extraLibraryPath /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
spark.sql.hive.metastore.sharedPrefixes com.amazonaws.services.dynamodbv2
spark.executor.memory 5120M
spark.driver.extraClassPath /usr/lib/hadoop-lzo/lib/:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/:/usr/share/aws/emr/emrfs/auxlib/:/usr/share/aws/ emr/security/conf:/usr/share/aws/emr/security/lib/*
spark.eventLog.dir hdfs:///var/log/spark/apps
spark.dynamicAllocation.enabled true
spark.executor.extraClassPath /usr/lib/hadoop-lzo/lib/:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/:/usr/share/aws/emr/emrfs/auxlib/:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*
spark.executor.cores 8
spark.history.ui.port 18080
spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_HOSTS ip-172-20-99-29.ec2.internal
spark.org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.param.PROXY_URI_BASES http://ip-172-20-99-29.ec2.internal:20888/proxy/application_1486842541319_0002
spark.app.id application_1486842541319_0002
spark.hadoop.yarn.timeline-service.enabled false
spark.sql.shuffle.partitions 10000
Error Trace:
17/02/11 22:01:05 INFO ShuffleBlockFetcherIterator: Started 29 remote fetches in 2700 ms
17/02/11 22:03:04 ERROR TransportChannelHandler: Connection to ip-172-20-96-109.ec2.internal/172.20.96.109:7337 has been quiet for 120000 ms while there are outstanding requests. Assuming connection is dead; please adjust spark.network.timeout if this is wrong.
17/02/11 22:03:04 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from ip-172-20-96-109.ec2.internal/172.20.96.109:7337 is closed
17/02/11 22:03:04 ERROR OneForOneBlockFetcher: Failed while starting block fetches
java.io.IOException: Connection from ip-172-20-96-109.ec2.internal/172.20.96.109:7337 closed
at org.apache.spark.network.client.TransportResponseHandler.channelInactive(TransportResponseHandler.java:128)
at org.apache.spark.network.server.TransportChannelHandler.channelInactive(TransportChannelHandler.java:109)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:251)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:230)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at io.netty.handler.timeout.IdleStateHandler.channelInactive(IdleStateHandler.java:257)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:251)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:230)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:251)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:230)
at io.netty.channel.ChannelInboundHandlerAdapter.channelInactive(ChannelInboundHandlerAdapter.java:75)
at org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:182)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:251)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:230)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1289)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:251)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:237)
at io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:893)
at io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:691)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:408)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:455)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:745)
02-24-2017 01:27 AM
I have increased my timeout to 1200s ( i.e spark.network.timeout=1200s) . Still I am getting netty error. This time error occurs on block replication.
17/02/24 09:10:21 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from ip-172-20-101-120.ec2.internal/172.20.101.120:46113 is closed
17/02/24 09:10:21 ERROR NettyBlockTransferService: Error while uploading block rdd_24_2312
java.io.IOException: Connection from ip-172-20-101-120.ec2.internal/172.20.101.120:46113 closed
02-24-2017 01:27 AM
I have increased my timeout to 1200s ( i.e spark.network.timeout=1200s) . Still I am getting netty error. This time error occurs on block replication.
17/02/24 09:10:21 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from ip-172-20-101-120.ec2.internal/172.20.101.120:46113 is closed
17/02/24 09:10:21 ERROR NettyBlockTransferService: Error while uploading block rdd_24_2312
java.io.IOException: Connection from ip-172-20-101-120.ec2.internal/172.20.101.120:46113 closed
11-20-2017 10:29 PM
I have also encountered this problem, it is estimated that when shuffle the network bandwidth reaches the limit and timeout. I solved this problem successfully by reducing the number of executor
05-01-2018 04:19 PM
I am facing the same issue , I am doing shuffle using group by key operation with very few connectors and facing the connection closed issue from one of the nodes.
09-03-2018 02:20 AM
@Satheessh Chinnusamy how did you solve the above issue
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group