cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

Model Serving Endpoint keeps failing with SIGKILL error

AChang
New Contributor III

I am trying to deploy a model in the serving endpoints section, but it keeps failing after attempting to create for an hour. Here are the service logs:

Container failed with: 9 +0000] [115] [INFO] Booting worker with pid: 115
[2023-09-15 19:15:35 +0000] [2] [ERROR] Worker (pid:73) was sent SIGKILL! Perhaps out of memory?
[2023-09-15 19:15:35 +0000] [119] [INFO] Booting worker with pid: 119
[2023-09-15 19:15:57 +0000] [2] [ERROR] Worker (pid:99) was sent SIGKILL! Perhaps out of memory?
[2023-09-15 19:15:57 +0000] [131] [INFO] Booting worker with pid: 131
2023-09-15 19:16:05.631648: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-09-15 19:16:06.710808: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[2023-09-15 19:16:07 +0000] [2] [ERROR] Worker (pid:93) was sent SIGKILL! Perhaps out of memory?
[2023-09-15 19:16:07 +0000] [137] [INFO] Booting worker with pid: 137
[2023-09-15 19:16:35 +0000] [2] [ERROR] Worker (pid:119) was sent SIGKILL! Perhaps out of memory?
[2023-09-15 19:16:35 +0000] [155] [INFO] Booting worker with pid: 155
[2023-09-15 19:16:42 +0000] [2] [ERROR] Worker (pid:115) was sent SIGKILL! Perhaps out of memory?
[2023-09-15 19:16:42 +0000] [159] [INFO] Booting worker with pid: 159
[2023-09-15 19:17:10 +0000] [2] [ERROR] Worker (pid:131) was sent SIGKILL! Perhaps out of memory?
[2023-09-15 19:17:10 +0000] [175] [INFO] Booting worker with pid: 175
[2023-09-15 19:17:17 +0000] [2] [ERROR] Worker (pid:137) was sent SIGKILL! Perhaps out of memory?
[2023-09-15 19:17:17 +0000] [179] [INFO] Booting worker with pid: 179
[2023-09-15 19:17:46 +0000] [2] [ERROR] Worker (pid:159) was sent SIGKILL! Perhaps out of memory?
[2023-09-15 19:17:46 +0000] [195] [INFO] Booting worker with pid: 195

Should I try moving to the largest compute, or is the issue more to do with the model itself?

2 REPLIES 2

KAdamatzky
New Contributor III

Hi, did you find a solution to this? I am having the same problem.

Alberto_Umana
Databricks Employee
Databricks Employee

Hello @AChang,

This is a common issue when the memory requirements of your model exceed the available memory on your current compute resources.

 

  • Moving to a larger compute instance with more memory can help accommodate the memory requirements of your model. This is often the simplest solution if you have the resources available.

  • As indicated in the logs, setting the environment variable TF_ENABLE_ONEDNN_OPTS=0 can disable oneDNN custom operations, which might help in some cases

  • Ensure that there are no memory leaks in your code. This can be done by monitoring memory usage over time and ensuring that it does not continuously increase

 

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now