Training Job Failure (Driver Error)

jonathanhodges
New Contributor II

We have a new model training job that was running fine for a few days and then started failing. I have attached images for more details.

I am wondering if 'can't reach driver cluster' is a red herring. It says the driver is healthy right before execution

When I look into the logs, it looks like a library problem potentially with numpy.

ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject
Traceback (most recent call last):
from pandas._libs.interval import Interval
File "pandas/_libs/interval.pyx", line 1, in init pandas._libs.interval

 

Has anyone seen this before and have any ideas or suggestions?