cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

Model Serving Endpoint keeps failing with SIGKILL error

AChang
New Contributor III

I am trying to deploy a model in the serving endpoints section, but it keeps failing after attempting to create for an hour. Here are the service logs:

Container failed with: 9 +0000] [115] [INFO] Booting worker with pid: 115
[2023-09-15 19:15:35 +0000] [2] [ERROR] Worker (pid:73) was sent SIGKILL! Perhaps out of memory?
[2023-09-15 19:15:35 +0000] [119] [INFO] Booting worker with pid: 119
[2023-09-15 19:15:57 +0000] [2] [ERROR] Worker (pid:99) was sent SIGKILL! Perhaps out of memory?
[2023-09-15 19:15:57 +0000] [131] [INFO] Booting worker with pid: 131
2023-09-15 19:16:05.631648: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-09-15 19:16:06.710808: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[2023-09-15 19:16:07 +0000] [2] [ERROR] Worker (pid:93) was sent SIGKILL! Perhaps out of memory?
[2023-09-15 19:16:07 +0000] [137] [INFO] Booting worker with pid: 137
[2023-09-15 19:16:35 +0000] [2] [ERROR] Worker (pid:119) was sent SIGKILL! Perhaps out of memory?
[2023-09-15 19:16:35 +0000] [155] [INFO] Booting worker with pid: 155
[2023-09-15 19:16:42 +0000] [2] [ERROR] Worker (pid:115) was sent SIGKILL! Perhaps out of memory?
[2023-09-15 19:16:42 +0000] [159] [INFO] Booting worker with pid: 159
[2023-09-15 19:17:10 +0000] [2] [ERROR] Worker (pid:131) was sent SIGKILL! Perhaps out of memory?
[2023-09-15 19:17:10 +0000] [175] [INFO] Booting worker with pid: 175
[2023-09-15 19:17:17 +0000] [2] [ERROR] Worker (pid:137) was sent SIGKILL! Perhaps out of memory?
[2023-09-15 19:17:17 +0000] [179] [INFO] Booting worker with pid: 179
[2023-09-15 19:17:46 +0000] [2] [ERROR] Worker (pid:159) was sent SIGKILL! Perhaps out of memory?
[2023-09-15 19:17:46 +0000] [195] [INFO] Booting worker with pid: 195

Should I try moving to the largest compute, or is the issue more to do with the model itself?

1 ACCEPTED SOLUTION

Accepted Solutions

Kaniz_Fatma
Community Manager
Community Manager

Hi @AChang , Based on the logs provided, it appears that your workers are being terminated due to insufficient memory, as indicated by the repeated "Worker (pid:X) was sent SIGKILL! Perhaps out of memory?" messages. This suggests that the model you're trying to deploy might be too large or complex for the current amount of allocated memory.Databricks Model Serving, by default, provides 4 GB of memory for your model. If your model requires more memory, you can reach out to your Databricks support contact to increase this limit up to 16 GB per model.

Before moving to the largest compute, you might want to consider the following steps:

1. Try optimizing your model. This could involve simplifying the model architecture, reducing the dimensionality of your data, or using a more memory-efficient data representation.

2. Monitor the memory usage of your model during training and inference to get a sense of how much memory it requires.

3. If your model is indeed too large for the current memory allocation, consider requesting an increase in memory limit from the Databricks support.

Remember that moving to a larger compute resource may incur additional costs, so it's important to ensure that this is necessary before making the change.

View solution in original post

1 REPLY 1

Kaniz_Fatma
Community Manager
Community Manager

Hi @AChang , Based on the logs provided, it appears that your workers are being terminated due to insufficient memory, as indicated by the repeated "Worker (pid:X) was sent SIGKILL! Perhaps out of memory?" messages. This suggests that the model you're trying to deploy might be too large or complex for the current amount of allocated memory.Databricks Model Serving, by default, provides 4 GB of memory for your model. If your model requires more memory, you can reach out to your Databricks support contact to increase this limit up to 16 GB per model.

Before moving to the largest compute, you might want to consider the following steps:

1. Try optimizing your model. This could involve simplifying the model architecture, reducing the dimensionality of your data, or using a more memory-efficient data representation.

2. Monitor the memory usage of your model during training and inference to get a sense of how much memory it requires.

3. If your model is indeed too large for the current memory allocation, consider requesting an increase in memory limit from the Databricks support.

Remember that moving to a larger compute resource may incur additional costs, so it's important to ensure that this is necessary before making the change.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!