cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

The Python process exited with exit code 137 (SIGKILL: Killed). This may have been caused by an OOM error. Check your command's memory usage.

Koliya
New Contributor II

I am running a hugging face model on a GPU cluster (g4dn.xlarge, 16GB Memory, 4 cores). I run the same model in four different notebooks with different data sources. I created a workflow to run one model after the other. These notebooks run fine individually, but in the workflow setup, it gives me a Fatal error: The Python kernel is unresponsive (The Python process exited with exit code 137 (SIGKILL: Killed). This may have been caused by an OOM error. Check your command's memory usage.).

5 REPLIES 5

daniel_sahal
Esteemed Contributor

It could be due to the caching that may use some amount of memory when you're reusing cluster.

Simply try increasing your memory and/or optimize your code a little bit.

Koliya
New Contributor II

I am not using a big batch of data during the process. It's just five text documents with less than 1000 characters each approximately. I am utilising the GPU to run the transformer model. So the model itself is not really running on CPU. That's why it is weird to get an OOM error with a significantly less amount of data that's been processed from the CPU.

Kaniz
Community Manager
Community Manager

Hi @Koliya Wedanage​, We haven’t heard from you since the last response from @Daniel Sahal​, and I was checking back to see if their suggestions helped you.

Or else, If you have any solution, please share it with the community, as it can be helpful to others.

Also, Please don't forget to click on the "Select As Best" button whenever the information provided helps resolve your question.

jose_gonzalez
Moderator
Moderator

You can check the executor's logs to narrow down the error if you would like, but technically, this is a OOM and increasing your cluster's resource will mitigate this issue

fkemeth
New Contributor II

You might accumulate gradients when running your Huggingface model, which typically leads to out-of-memory errors after some iterations. If you use it for inference only, do

with torch.no_grad():
    # The code where you apply the model

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!