<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Is anyone else experiencing intermittent &amp;quot;Failure starting REPL&amp;quot; errors with PySpark Jobs? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/is-anyone-else-experiencing-intermittent-quot-failure-starting/m-p/30932#M22479</link>
    <description>&lt;P&gt;Hi Jordan. Thanks for the response! Annoying that there isn't an official answer. I have an open ticket with Microsoft who are also looking into it for me, I will update here if I get anything concrete!&lt;/P&gt;</description>
    <pubDate>Mon, 21 Nov 2022 14:20:22 GMT</pubDate>
    <dc:creator>James_Cole</dc:creator>
    <dc:date>2022-11-21T14:20:22Z</dc:date>
    <item>
      <title>Is anyone else experiencing intermittent "Failure starting REPL" errors with PySpark Jobs?</title>
      <link>https://community.databricks.com/t5/data-engineering/is-anyone-else-experiencing-intermittent-quot-failure-starting/m-p/30926#M22473</link>
      <description>&lt;P&gt;I have a Multi-Task Job that is running a bunch of PySpark notebooks and about 30-60% of the time, my jobs fail with the following error:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="image.png"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1457i93BBA8DD40727C1E/image-size/large?v=v2&amp;amp;px=999" role="button" title="image.png" alt="image.png" /&gt;&lt;/span&gt;I haven't seen any consistency with this error. I've had as many as all of the tasks in the job giving this error, as few as a single task throwing it, and everything in between.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;What's confusing the living daylights out of me is that this isn't an interactive cluster so I'm not sure what the cause is.  Any help would be appreciated.&lt;/P&gt;</description>
      <pubDate>Fri, 23 Sep 2022 22:16:51 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/is-anyone-else-experiencing-intermittent-quot-failure-starting/m-p/30926#M22473</guid>
      <dc:creator>JordanYaker</dc:creator>
      <dc:date>2022-09-23T22:16:51Z</dc:date>
    </item>
    <item>
      <title>Re: Is anyone else experiencing intermittent "Failure starting REPL" errors with PySpark Jobs?</title>
      <link>https://community.databricks.com/t5/data-engineering/is-anyone-else-experiencing-intermittent-quot-failure-starting/m-p/30927#M22474</link>
      <description>&lt;P&gt;Hi @Jordan Yaker​&amp;nbsp;are you using DCS (Databricks Container Services)? Ands also, what runtime are you using?&lt;/P&gt;</description>
      <pubDate>Wed, 28 Sep 2022 16:07:14 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/is-anyone-else-experiencing-intermittent-quot-failure-starting/m-p/30927#M22474</guid>
      <dc:creator>User16741082858</dc:creator>
      <dc:date>2022-09-28T16:07:14Z</dc:date>
    </item>
    <item>
      <title>Re: Is anyone else experiencing intermittent "Failure starting REPL" errors with PySpark Jobs?</title>
      <link>https://community.databricks.com/t5/data-engineering/is-anyone-else-experiencing-intermittent-quot-failure-starting/m-p/30928#M22475</link>
      <description>&lt;P&gt;@Pearl Ubaru​&amp;nbsp;I'm not using DCS and I was using 11.3.  My account rep talked to some people internally and suggested rolling back to 10.4.  I ended up doing that and the problem seems to have gone away.  Unfortunately this leaves me without the ability to utilize the `availableNow`, but I'd rather have a stable system than that trigger.&lt;/P&gt;</description>
      <pubDate>Wed, 28 Sep 2022 18:07:58 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/is-anyone-else-experiencing-intermittent-quot-failure-starting/m-p/30928#M22475</guid>
      <dc:creator>JordanYaker</dc:creator>
      <dc:date>2022-09-28T18:07:58Z</dc:date>
    </item>
    <item>
      <title>Re: Is anyone else experiencing intermittent "Failure starting REPL" errors with PySpark Jobs?</title>
      <link>https://community.databricks.com/t5/data-engineering/is-anyone-else-experiencing-intermittent-quot-failure-starting/m-p/30929#M22476</link>
      <description>&lt;P&gt;I was going to assume it has something to do with the runtime.  Please bear with us as we work to improve on our end. I am glad this work-around is efficient for now.&lt;/P&gt;</description>
      <pubDate>Wed, 28 Sep 2022 18:34:20 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/is-anyone-else-experiencing-intermittent-quot-failure-starting/m-p/30929#M22476</guid>
      <dc:creator>User16741082858</dc:creator>
      <dc:date>2022-09-28T18:34:20Z</dc:date>
    </item>
    <item>
      <title>Re: Is anyone else experiencing intermittent "Failure starting REPL" errors with PySpark Jobs?</title>
      <link>https://community.databricks.com/t5/data-engineering/is-anyone-else-experiencing-intermittent-quot-failure-starting/m-p/30930#M22477</link>
      <description>&lt;P&gt;Hi. Did you ever got a resolution to this problem outside of rolling back to 10.4? I have recently moved some workloads over to runtime 11.3 and am experiencing intermittent "repl did not start in 30 seconds." errors.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I have increased the repl timeout as per Microsoft advice to 150 seconds but this hasn't fixed the issue. They have also suggested increasing the size of the cluster, but this doesn't feel like the right solution.&lt;/P&gt;</description>
      <pubDate>Mon, 21 Nov 2022 12:05:48 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/is-anyone-else-experiencing-intermittent-quot-failure-starting/m-p/30930#M22477</guid>
      <dc:creator>James_Cole</dc:creator>
      <dc:date>2022-11-21T12:05:48Z</dc:date>
    </item>
    <item>
      <title>Re: Is anyone else experiencing intermittent "Failure starting REPL" errors with PySpark Jobs?</title>
      <link>https://community.databricks.com/t5/data-engineering/is-anyone-else-experiencing-intermittent-quot-failure-starting/m-p/30931#M22478</link>
      <description>&lt;P&gt;I did not.  11.3 still seems to have stability issues despite it being the next LTS. I still get the REPL errors along with "The Python kernel is unresponsive."  It's really annoying.&lt;/P&gt;</description>
      <pubDate>Mon, 21 Nov 2022 12:57:43 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/is-anyone-else-experiencing-intermittent-quot-failure-starting/m-p/30931#M22478</guid>
      <dc:creator>JordanYaker</dc:creator>
      <dc:date>2022-11-21T12:57:43Z</dc:date>
    </item>
    <item>
      <title>Re: Is anyone else experiencing intermittent "Failure starting REPL" errors with PySpark Jobs?</title>
      <link>https://community.databricks.com/t5/data-engineering/is-anyone-else-experiencing-intermittent-quot-failure-starting/m-p/30932#M22479</link>
      <description>&lt;P&gt;Hi Jordan. Thanks for the response! Annoying that there isn't an official answer. I have an open ticket with Microsoft who are also looking into it for me, I will update here if I get anything concrete!&lt;/P&gt;</description>
      <pubDate>Mon, 21 Nov 2022 14:20:22 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/is-anyone-else-experiencing-intermittent-quot-failure-starting/m-p/30932#M22479</guid>
      <dc:creator>James_Cole</dc:creator>
      <dc:date>2022-11-21T14:20:22Z</dc:date>
    </item>
    <item>
      <title>Re: Is anyone else experiencing intermittent "Failure starting REPL" errors with PySpark Jobs?</title>
      <link>https://community.databricks.com/t5/data-engineering/is-anyone-else-experiencing-intermittent-quot-failure-starting/m-p/30933#M22480</link>
      <description>&lt;P&gt;Had the following update from Databricks support.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;"We can see the below error just before the repls started failing -&lt;/P&gt;&lt;P&gt; 22/11/17 05:32:07 ERROR WSFSDriverManager$: Failed to get associated pid for WSFS&lt;/P&gt;&lt;P&gt; In the driver logs we could see several repls being initialized during that time. Going through similar scenarios with other customers in our backlogs we have seen reducing the concurrency helps mitigate the problem. Increasing the driver size will help as well since it will provide more cores for concurrent execution."&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Still not convinced this gets to the root of the problem as everything seems stable now we have rolled clusters back to 10.4...&lt;/P&gt;</description>
      <pubDate>Thu, 24 Nov 2022 09:53:50 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/is-anyone-else-experiencing-intermittent-quot-failure-starting/m-p/30933#M22480</guid>
      <dc:creator>James_Cole</dc:creator>
      <dc:date>2022-11-24T09:53:50Z</dc:date>
    </item>
  </channel>
</rss>

