<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Notebook failing in job-cluster but runs fine in all-purpose-cluster with the same configuration in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/notebook-failing-in-job-cluster-but-runs-fine-in-all-purpose/m-p/12200#M7049</link>
    <description>&lt;P&gt;I have a notebook with many join and few persist operations (which runs fine on all-purpose-cluster (with worker nodes - i3.xlarge and autoscale enabled), but the same notebook failing in job-cluster with the same cluster definition (to be frank the job-cluster has even better worker nodes - i3.8xlarge)&lt;/P&gt;&lt;P&gt;Cluster Conf:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="job-cluster"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/2358i448CCDA4B30CC135/image-size/large?v=v2&amp;amp;px=999" role="button" title="job-cluster" alt="job-cluster" /&gt;&lt;/span&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="all-purpose-cluster"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/2360i382A1620C63B5D35/image-size/large?v=v2&amp;amp;px=999" role="button" title="all-purpose-cluster" alt="all-purpose-cluster" /&gt;&lt;/span&gt;Spark Conf:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;spark.databricks.delta.optimizeWrite.enabled true
spark.databricks.adaptive.autoOptimizedShuffle.enabled true&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Error:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: ShuffleMapStage 69 (sql at command-3296064203992845:4) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.FetchFailedException: Connection reset by peer 	at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:749) 	at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:662) 	at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:69) 	at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29) 	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) 	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) 	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) 	at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31) 	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) 	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) 	at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:240) 	at org.apache.spark.sql.execution.SortExec$$anon$2.sortedIterator(SortExec.scala:133) 	at org.apache.spark.sql.execution.SortExec$$anon$2.hasNext(SortExec.scala:147) 	at org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83) 	at org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedBufferedToRowWithNullFreeJoinKey(SortMergeJoinExec.scala:950) 	at org.apache.spark.sql.execution.joins.SortMergeJoinScanner.&amp;lt;init&amp;gt;(SortMergeJoinExec.scala:820) 	at org.apache.spark.sql.execution.joins.SortMergeJoinExec.$anonfun$doExecute$1(SortMergeJoinExec.scala:258) 	at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:101) 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:356) 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:320) 	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60) 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:356) 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:320) 	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60) 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:356) 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:320) 	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) 	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) 	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) 	at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144) 	at org.apache.spark.scheduler.Task.run(Task.scala:117) 	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:655) 	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581) 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:658) 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 	at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Connection reset by peer 	at sun.nio.ch.FileDispatcherImpl.read0(Native Method) 	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) 	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) 	at sun.nio.ch.IOUtil.read(IOUtil.java:192) 	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) 	at io.netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:253) 	at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1133) 	at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:350) 	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148) 	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714) 	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650) 	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576) 	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) 	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) 	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) 	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) 	... 1 more 
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2519)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2466)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2460)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2460)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:2050)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2718)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2668)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2656)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;B&gt;Note: &lt;I&gt;I&lt;/I&gt;&lt;/B&gt;&lt;I&gt;f you also notice the EBS Volume Type in job-cluster is displayed as &lt;/I&gt;&lt;B&gt;&lt;I&gt;Autoscaling Local Storage&lt;/I&gt;&lt;/B&gt;&lt;I&gt; instead of &lt;/I&gt;&lt;B&gt;&lt;I&gt;General Purpose SSD&lt;/I&gt;&lt;/B&gt;&lt;I&gt;, as I have set it as General Purpose SSD&lt;/I&gt;&lt;/P&gt;</description>
    <pubDate>Thu, 28 Oct 2021 11:22:49 GMT</pubDate>
    <dc:creator>AjayHN</dc:creator>
    <dc:date>2021-10-28T11:22:49Z</dc:date>
    <item>
      <title>Notebook failing in job-cluster but runs fine in all-purpose-cluster with the same configuration</title>
      <link>https://community.databricks.com/t5/data-engineering/notebook-failing-in-job-cluster-but-runs-fine-in-all-purpose/m-p/12200#M7049</link>
      <description>&lt;P&gt;I have a notebook with many join and few persist operations (which runs fine on all-purpose-cluster (with worker nodes - i3.xlarge and autoscale enabled), but the same notebook failing in job-cluster with the same cluster definition (to be frank the job-cluster has even better worker nodes - i3.8xlarge)&lt;/P&gt;&lt;P&gt;Cluster Conf:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="job-cluster"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/2358i448CCDA4B30CC135/image-size/large?v=v2&amp;amp;px=999" role="button" title="job-cluster" alt="job-cluster" /&gt;&lt;/span&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="all-purpose-cluster"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/2360i382A1620C63B5D35/image-size/large?v=v2&amp;amp;px=999" role="button" title="all-purpose-cluster" alt="all-purpose-cluster" /&gt;&lt;/span&gt;Spark Conf:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;spark.databricks.delta.optimizeWrite.enabled true
spark.databricks.adaptive.autoOptimizedShuffle.enabled true&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Error:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: ShuffleMapStage 69 (sql at command-3296064203992845:4) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.FetchFailedException: Connection reset by peer 	at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:749) 	at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:662) 	at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:69) 	at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29) 	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) 	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) 	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) 	at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31) 	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) 	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) 	at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:240) 	at org.apache.spark.sql.execution.SortExec$$anon$2.sortedIterator(SortExec.scala:133) 	at org.apache.spark.sql.execution.SortExec$$anon$2.hasNext(SortExec.scala:147) 	at org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83) 	at org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedBufferedToRowWithNullFreeJoinKey(SortMergeJoinExec.scala:950) 	at org.apache.spark.sql.execution.joins.SortMergeJoinScanner.&amp;lt;init&amp;gt;(SortMergeJoinExec.scala:820) 	at org.apache.spark.sql.execution.joins.SortMergeJoinExec.$anonfun$doExecute$1(SortMergeJoinExec.scala:258) 	at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:101) 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:356) 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:320) 	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60) 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:356) 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:320) 	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60) 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:356) 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:320) 	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) 	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) 	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) 	at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144) 	at org.apache.spark.scheduler.Task.run(Task.scala:117) 	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:655) 	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581) 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:658) 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 	at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Connection reset by peer 	at sun.nio.ch.FileDispatcherImpl.read0(Native Method) 	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) 	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) 	at sun.nio.ch.IOUtil.read(IOUtil.java:192) 	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) 	at io.netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:253) 	at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1133) 	at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:350) 	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148) 	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714) 	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650) 	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576) 	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) 	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) 	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) 	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) 	... 1 more 
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2519)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2466)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2460)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2460)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:2050)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2718)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2668)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2656)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;B&gt;Note: &lt;I&gt;I&lt;/I&gt;&lt;/B&gt;&lt;I&gt;f you also notice the EBS Volume Type in job-cluster is displayed as &lt;/I&gt;&lt;B&gt;&lt;I&gt;Autoscaling Local Storage&lt;/I&gt;&lt;/B&gt;&lt;I&gt; instead of &lt;/I&gt;&lt;B&gt;&lt;I&gt;General Purpose SSD&lt;/I&gt;&lt;/B&gt;&lt;I&gt;, as I have set it as General Purpose SSD&lt;/I&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 28 Oct 2021 11:22:49 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/notebook-failing-in-job-cluster-but-runs-fine-in-all-purpose/m-p/12200#M7049</guid>
      <dc:creator>AjayHN</dc:creator>
      <dc:date>2021-10-28T11:22:49Z</dc:date>
    </item>
    <item>
      <title>Re: Notebook failing in job-cluster but runs fine in all-purpose-cluster with the same configuration</title>
      <link>https://community.databricks.com/t5/data-engineering/notebook-failing-in-job-cluster-but-runs-fine-in-all-purpose/m-p/12202#M7051</link>
      <description>&lt;P&gt;Hi @Ajay Nanjundappa​&amp;nbsp;,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Check "Event log"  tab. Search for any spot terminations events. It seems like all your nodes are spot instances. The error "FetchFailedException" is associated with spot termination nodes. &lt;/P&gt;</description>
      <pubDate>Sat, 13 Nov 2021 00:48:13 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/notebook-failing-in-job-cluster-but-runs-fine-in-all-purpose/m-p/12202#M7051</guid>
      <dc:creator>jose_gonzalez</dc:creator>
      <dc:date>2021-11-13T00:48:13Z</dc:date>
    </item>
  </channel>
</rss>

