Databricks

JordanYaker · ‎09-23-2022

I have a Multi-Task Job that is running a bunch of PySpark notebooks and about 30-60% of the time, my jobs fail with the following error:

I haven't seen any consistency with this error. I've had as many as all of the tasks in the job giving this error, as few as a single task throwing it, and everything in between.

What's confusing the living daylights out of me is that this isn't an interactive cluster so I'm not sure what the cause is. Any help would be appreciated.

User16741082858 · ‎09-28-2022

I was going to assume it has something to do with the runtime. Please bear with us as we work to improve on our end. I am glad this work-around is efficient for now.

View solution in original post

User16741082858 · ‎09-28-2022

Hi @Jordan Yaker are you using DCS (Databricks Container Services)? Ands also, what runtime are you using?

JordanYaker · ‎09-28-2022

@Pearl Ubaru I'm not using DCS and I was using 11.3. My account rep talked to some people internally and suggested rolling back to 10.4. I ended up doing that and the problem seems to have gone away. Unfortunately this leaves me without the ability to utilize the `availableNow`, but I'd rather have a stable system than that trigger.

User16741082858 · ‎09-28-2022

I was going to assume it has something to do with the runtime. Please bear with us as we work to improve on our end. I am glad this work-around is efficient for now.

James_Cole · ‎11-21-2022

Hi. Did you ever got a resolution to this problem outside of rolling back to 10.4? I have recently moved some workloads over to runtime 11.3 and am experiencing intermittent "repl did not start in 30 seconds." errors.

I have increased the repl timeout as per Microsoft advice to 150 seconds but this hasn't fixed the issue. They have also suggested increasing the size of the cluster, but this doesn't feel like the right solution.

JordanYaker · ‎11-21-2022

I did not. 11.3 still seems to have stability issues despite it being the next LTS. I still get the REPL errors along with "The Python kernel is unresponsive." It's really annoying.

James_Cole · ‎11-21-2022

Hi Jordan. Thanks for the response! Annoying that there isn't an official answer. I have an open ticket with Microsoft who are also looking into it for me, I will update here if I get anything concrete!

James_Cole · ‎11-24-2022

Had the following update from Databricks support.

"We can see the below error just before the repls started failing -

22/11/17 05:32:07 ERROR WSFSDriverManager$: Failed to get associated pid for WSFS

In the driver logs we could see several repls being initialized during that time. Going through similar scenarios with other customers in our backlogs we have seen reducing the concurrency helps mitigate the problem. Increasing the driver size will help as well since it will provide more cores for concurrent execution."

Still not convinced this gets to the root of the problem as everything seems stable now we have rolled clusters back to 10.4...

Databricks

Is anyone else experiencing intermittent "Failure starting REPL" errors with PySpark Jobs?

Announcing the General Availability of Databricks Asset Bundles

How to successfully build GenAI applications

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Register now and save 50% on training at Data + AI Summit!