cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Is anyone else experiencing intermittent "Failure starting REPL" errors with PySpark Jobs?

JordanYaker
Contributor

I have a Multi-Task Job that is running a bunch of PySpark notebooks and about 30-60% of the time, my jobs fail with the following error:

image.pngI haven't seen any consistency with this error. I've had as many as all of the tasks in the job giving this error, as few as a single task throwing it, and everything in between.

What's confusing the living daylights out of me is that this isn't an interactive cluster so I'm not sure what the cause is. Any help would be appreciated.

1 ACCEPTED SOLUTION

Accepted Solutions

I was going to assume it has something to do with the runtime. Please bear with us as we work to improve on our end. I am glad this work-around is efficient for now.

View solution in original post

7 REPLIES 7

User16741082858
Contributor III

Hi @Jordan Yaker​ are you using DCS (Databricks Container Services)? Ands also, what runtime are you using?

JordanYaker
Contributor

@Pearl Ubaru​ I'm not using DCS and I was using 11.3. My account rep talked to some people internally and suggested rolling back to 10.4. I ended up doing that and the problem seems to have gone away. Unfortunately this leaves me without the ability to utilize the `availableNow`, but I'd rather have a stable system than that trigger.

I was going to assume it has something to do with the runtime. Please bear with us as we work to improve on our end. I am glad this work-around is efficient for now.

James_Cole
New Contributor III

Hi. Did you ever got a resolution to this problem outside of rolling back to 10.4? I have recently moved some workloads over to runtime 11.3 and am experiencing intermittent "repl did not start in 30 seconds." errors.

I have increased the repl timeout as per Microsoft advice to 150 seconds but this hasn't fixed the issue. They have also suggested increasing the size of the cluster, but this doesn't feel like the right solution.

I did not. 11.3 still seems to have stability issues despite it being the next LTS. I still get the REPL errors along with "The Python kernel is unresponsive." It's really annoying.

Hi Jordan. Thanks for the response! Annoying that there isn't an official answer. I have an open ticket with Microsoft who are also looking into it for me, I will update here if I get anything concrete!

Had the following update from Databricks support.

"We can see the below error just before the repls started failing -

22/11/17 05:32:07 ERROR WSFSDriverManager$: Failed to get associated pid for WSFS

In the driver logs we could see several repls being initialized during that time. Going through similar scenarios with other customers in our backlogs we have seen reducing the concurrency helps mitigate the problem. Increasing the driver size will help as well since it will provide more cores for concurrent execution."

Still not convinced this gets to the root of the problem as everything seems stable now we have rolled clusters back to 10.4...

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group