Databricks Community

quintrix · ‎06-02-2024

Hi,

https://notebooks.databricks.com/demos/llm-rag-chatbot/index.html

Following this tutorial I'm trying to serve an endpoint with DBRX model connected to my data in Vector Db.
Without any problem I can log my model in Databricks with MLFlow and call the model locally form notebooks but when I try to serve the endpoint it still fails after about 35-40 minutes with message:

OperationFailed: failed to reach NOT_UPDATING, got EndpointStateConfigUpdate.UPDATE_FAILED: current status: EndpointStateConfigUpdate.UPDATE_FAILED

In the create_and_wait() method I set the timeout parameter for two hours to prevent stopping the method after default 20 minutes like so:

w.serving_endpoints.create_and_wait(name=serving_endpoint_name, config=endpoint_config, timeout=timedelta(hours=2))

and the value is working properly but there must be another issue causing timeout error.

Screenshots from Serving tab in Databricks:

In the service logs I can see also some exceptions rised by conda:

Any idea how to solve the issue?

quintrix · ‎06-04-2024

Thank you for the answer.

The model is located in Unity Catalog like so:
The model isn't deployed yet so can't check health metrics.
I don't use Azure DevOps
I've implemented 5 retries (first run creates the endpoint, next ones try to update it), but all generate the same error. Each time it seems to fail after similar period of time:
If I understand correctly, the model serving does not take place on my cluster where I can set environment variables - correct me if I'm wrong please. I can run endpoint creation with cluster off using UI and none of my clusters are running at this time:

So where can I set the variable?
I've set the variable in a code before executing create_and_wait() method but not sure if it's correct.
Any other ideas?
What about the conda exceptions during deployement - how could I debug it.

As a test I also served simple linear regression model. The endpoint has been created successfully and works fine.