The Python process exited with exit code 137 (SIGKILL: Killed). This may have been caused by an OOM error. Check your command's memory usage.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-21-2022 06:47 PM
I am running a hugging face model on a GPU cluster (g4dn.xlarge, 16GB Memory, 4 cores). I run the same model in four different notebooks with different data sources. I created a workflow to run one model after the other. These notebooks run fine individually, but in the workflow setup, it gives me a Fatal error: The Python kernel is unresponsive (The Python process exited with exit code 137 (SIGKILL: Killed). This may have been caused by an OOM error. Check your command's memory usage.).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-21-2022 11:25 PM
It could be due to the caching that may use some amount of memory when you're reusing cluster.
Simply try increasing your memory and/or optimize your code a little bit.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-04-2023 05:54 PM
I am not using a big batch of data during the process. It's just five text documents with less than 1000 characters each approximately. I am utilising the GPU to run the transformer model. So the model itself is not really running on CPU. That's why it is weird to get an OOM error with a significantly less amount of data that's been processed from the CPU.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-27-2022 03:25 PM
You can check the executor's logs to narrow down the error if you would like, but technically, this is a OOM and increasing your cluster's resource will mitigate this issue
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-27-2023 02:12 AM
You might accumulate gradients when running your Huggingface model, which typically leads to out-of-memory errors after some iterations. If you use it for inference only, do
with torch.no_grad():
# The code where you apply the model

