cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

DBRX - Serving endpoint failed - update timed out.

quintrix
New Contributor II

Hi,

https://notebooks.databricks.com/demos/llm-rag-chatbot/index.html

Following this tutorial I'm trying to serve an endpoint with DBRX model connected to my data in Vector Db.
Without any problem I can log my model in Databricks with MLFlow and call the model locally form notebooks but when I try to serve the endpoint it still fails after about 35-40 minutes with message:

OperationFailed: failed to reach NOT_UPDATING, got EndpointStateConfigUpdate.UPDATE_FAILED: current status: EndpointStateConfigUpdate.UPDATE_FAILED

In the create_and_wait() method I set the timeout parameter for two hours to prevent stopping the method after default 20 minutes like so: 

w.serving_endpoints.create_and_wait(name=serving_endpoint_name, config=endpoint_config, timeout=timedelta(hours=2))
and the value is working properly but there must be another issue causing timeout error.

Screenshots from Serving tab in Databricks:
quintrix_0-1717395746345.png

In the service logs I can see also some exceptions rised by conda:

quintrix_1-1717395859283.png

Any idea how to solve the issue? 

1 REPLY 1

Thank you for the answer.

  • The model is located in Unity Catalog like so:
    quintrix_1-1717485792659.png
  • The model isn't deployed yet so can't check health metrics.
  • I don't use Azure DevOps
  • I've implemented 5 retries (first run creates the endpoint, next ones try to update it), but all generate the same error. Each time it seems to fail after similar period of time:
    quintrix_2-1717496898846.png
  • If I understand correctly, the model serving does not take place on my cluster where I can set environment variables - correct me if I'm wrong please. I can run endpoint creation with cluster off using UI and none of my clusters are running at this time:
    quintrix_3-1717498078275.png
    So where can I set the variable?
    I've set the variable in a code before executing create_and_wait() method but not sure if it's correct.
  • Any other ideas?
  • What about the conda exceptions during deployement - how could I debug it.

As a test I also served simple linear regression model. The endpoint has been created successfully and works fine. 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group