Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
Showing results for 
Search instead for 
Did you mean: 

DBRX - Serving endpoint failed - update timed out.

New Contributor II


Following this tutorial I'm trying to serve an endpoint with DBRX model connected to my data in Vector Db.
Without any problem I can log my model in Databricks with MLFlow and call the model locally form notebooks but when I try to serve the endpoint it still fails after about 35-40 minutes with message:

OperationFailed: failed to reach NOT_UPDATING, got EndpointStateConfigUpdate.UPDATE_FAILED: current status: EndpointStateConfigUpdate.UPDATE_FAILED

In the create_and_wait() method I set the timeout parameter for two hours to prevent stopping the method after default 20 minutes like so: 

w.serving_endpoints.create_and_wait(name=serving_endpoint_name, config=endpoint_config, timeout=timedelta(hours=2))
and the value is working properly but there must be another issue causing timeout error.

Screenshots from Serving tab in Databricks:

In the service logs I can see also some exceptions rised by conda:


Any idea how to solve the issue? 


Community Manager
Community Manager

Hi @quintrix

  • Verify that your model artifacts are correctly located. Ensure that the model files are accessible and in the expected location.
  • If your model has any initialization scripts or package dependencies, make sure they are set up correctly.
  • Keep an eye on endpoint health metrics such as QPS (queries per second), latency, and error rates. These can help diagnose any anomalies.
  • If you notice any unusual patterns, investigate further to identify the root cause.
  • As an immediate workaround, consider adding retries around the model download logic. You can add a couple of retries with a short sleep interval (e.g., 1 second) between retries.
  • Transient network issues might be causing the timeout, and retries can help mitigate this.
  • If you’re using Azure DevOps Pipelines, ensure that proper firewall rules are set up for your Azure database.
  • Incorrect firewall rules could prevent successful communication between your Databricks cluster and the database.
  • MLflow has a parameter called MLFLOW_SCORING_SERVER_REQUEST_TIMEOUT that controls the timeout for model scoring server requests.
  • You can set this environment variable in your deployment environment. 

Good luck with resolving the issue! 

Thank you for the answer.

  • The model is located in Unity Catalog like so:
  • The model isn't deployed yet so can't check health metrics.
  • I don't use Azure DevOps
  • I've implemented 5 retries (first run creates the endpoint, next ones try to update it), but all generate the same error. Each time it seems to fail after similar period of time:
  • If I understand correctly, the model serving does not take place on my cluster where I can set environment variables - correct me if I'm wrong please. I can run endpoint creation with cluster off using UI and none of my clusters are running at this time:
    So where can I set the variable?
    I've set the variable in a code before executing create_and_wait() method but not sure if it's correct.
  • Any other ideas?
  • What about the conda exceptions during deployement - how could I debug it.

As a test I also served simple linear regression model. The endpoint has been created successfully and works fine. 

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!