Training Job Failure (Driver Error)
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-30-2024 05:50 PM
We have a new model training job that was running fine for a few days and then started failing. I have attached images for more details.
I am wondering if 'can't reach driver cluster' is a red herring. It says the driver is healthy right before execution
When I look into the logs, it looks like a library problem potentially with numpy.
ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject
Traceback (most recent call last):
from pandas._libs.interval import Interval
File "pandas/_libs/interval.pyx", line 1, in init pandas._libs.interval
Has anyone seen this before and have any ideas or suggestions?