DBRX - Serving endpoint failed - update timed out.

Following this tutorial I'm trying to serve an endpoint with DBRX model connected to my data in Vector Db.
Without any problem I can log my model in Databricks with MLFlow and call the model locally form notebooks but when I try to serve the endpoint it still fails after about 35-40 minutes with message:

OperationFailed: failed to reach NOT_UPDATING, got EndpointStateConfigUpdate.UPDATE_FAILED: current status: EndpointStateConfigUpdate.UPDATE_FAILED

In the create_and_wait() method I set the timeout parameter for two hours to prevent stopping the method after default 20 minutes like so: 

w.serving_endpoints.create_and_wait(name=serving_endpoint_name, config=endpoint_config, timeout=timedelta(hours=2))
and the value is working properly but there must be another issue causing timeout error.

Screenshots from Serving tab in Databricks:

In the service logs I can see also some exceptions rised by conda:


Any idea how to solve the issue? 


Hi @quintrix

  • Verify that your model artifacts are correctly located. Ensure that the model files are accessible and in the expected location.
  • If your model has any initialization scripts or package dependencies, make sure they are set up correctly.
  • Keep an eye on endpoint health metrics such as QPS (queries per second), latency, and error rates. These can help diagnose any anomalies.
  • If you notice any unusual patterns, investigate further to identify the root cause.
  • As an immediate workaround, consider adding retries around the model download logic. You can add a couple of retries with a short sleep interval (e.g., 1 second) between retries.
  • Transient network issues might be causing the timeout, and retries can help mitigate this.
  • If you’re using Azure DevOps Pipelines, ensure that proper firewall rules are set up for your Azure database.
  • Incorrect firewall rules could prevent successful communication between your Databricks cluster and the database.
  • MLflow has a parameter called MLFLOW_SCORING_SERVER_REQUEST_TIMEOUT that controls the timeout for model scoring server requests.
  • You can set this environment variable in your deployment environment. 

Good luck with resolving the issue! 

Thank you for the answer.

  • The model is located in Unity Catalog like so:
  • The model isn't deployed yet so can't check health metrics.
  • I don't use Azure DevOps
  • I've implemented 5 retries (first run creates the endpoint, next ones try to update it), but all generate the same error. Each time it seems to fail after similar period of time:
  • If I understand correctly, the model serving does not take place on my cluster where I can set environment variables - correct me if I'm wrong please. I can run endpoint creation with cluster off using UI and none of my clusters are running at this time:
    So where can I set the variable?
    I've set the variable in a code before executing create_and_wait() method but not sure if it's correct.
  • Any other ideas?
  • What about the conda exceptions during deployement - how could I debug it.

As a test I also served simple linear regression model. The endpoint has been created successfully and works fine. 

