Databricks Community

AChang · ‎09-15-2023

I am trying to deploy a model in the serving endpoints section, but it keeps failing after attempting to create for an hour. Here are the service logs:

Container failed with: 9 +0000] [115] [INFO] Booting worker with pid: 115
[2023-09-15 19:15:35 +0000] [2] [ERROR] Worker (pid:73) was sent SIGKILL! Perhaps out of memory?
[2023-09-15 19:15:35 +0000] [119] [INFO] Booting worker with pid: 119
[2023-09-15 19:15:57 +0000] [2] [ERROR] Worker (pid:99) was sent SIGKILL! Perhaps out of memory?
[2023-09-15 19:15:57 +0000] [131] [INFO] Booting worker with pid: 131
2023-09-15 19:16:05.631648: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-09-15 19:16:06.710808: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[2023-09-15 19:16:07 +0000] [2] [ERROR] Worker (pid:93) was sent SIGKILL! Perhaps out of memory?
[2023-09-15 19:16:07 +0000] [137] [INFO] Booting worker with pid: 137
[2023-09-15 19:16:35 +0000] [2] [ERROR] Worker (pid:119) was sent SIGKILL! Perhaps out of memory?
[2023-09-15 19:16:35 +0000] [155] [INFO] Booting worker with pid: 155
[2023-09-15 19:16:42 +0000] [2] [ERROR] Worker (pid:115) was sent SIGKILL! Perhaps out of memory?
[2023-09-15 19:16:42 +0000] [159] [INFO] Booting worker with pid: 159
[2023-09-15 19:17:10 +0000] [2] [ERROR] Worker (pid:131) was sent SIGKILL! Perhaps out of memory?
[2023-09-15 19:17:10 +0000] [175] [INFO] Booting worker with pid: 175
[2023-09-15 19:17:17 +0000] [2] [ERROR] Worker (pid:137) was sent SIGKILL! Perhaps out of memory?
[2023-09-15 19:17:17 +0000] [179] [INFO] Booting worker with pid: 179
[2023-09-15 19:17:46 +0000] [2] [ERROR] Worker (pid:159) was sent SIGKILL! Perhaps out of memory?
[2023-09-15 19:17:46 +0000] [195] [INFO] Booting worker with pid: 195

Should I try moving to the largest compute, or is the issue more to do with the model itself?

Databricks Community

Model Serving Endpoint keeps failing with SIGKILL error

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks

Databricks Community Champion - December 2024 - Sujesh Menon

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences