cancel
Showing results for 
Search instead for 
Did you mean: 
Generative AI
Explore discussions on generative artificial intelligence techniques and applications within the Databricks Community. Share ideas, challenges, and breakthroughs in this cutting-edge field.
cancel
Showing results for 
Search instead for 
Did you mean: 

Model serving endpoint API error - Mosaic AI Agents API

JingXie
New Contributor II

Hi, 

I have built a Chatbot App using Streamlit (through Databricks Apps UI). The chatbot backend is a custom RAG model which is built by following this example notebook02-Deploy-RAG-Chatbot-Model.

The needed databricks packages are installed: 

databricks_vectorsearch-0.56
mlflow_2.20.2
databricks_langchain_0.5.0
databricks_agents-0.20.0
databricks_sdk-0.50.0
 
The RAG chain includes: databricks-gte-large-en model as the embedding model and databricks-meta-llama-3-1-70b-instruct model as the llm generator.
I registered the RAG chain to UC and deployed this RAG using Model Serving. When the RAG model is deployed and ready, the Chatbot UI can query this RAG model endpoint. Since the Chatbot App was created by Databricks App, it by default uses Service Principal to access the RAG model serving endpoint. I also implemented a Token Manager class to refresh the access token generation before the token expires. 
 
However, every time, after the RAG model endpoint was invoked (either restart, or scale from zero), it only worked for one hour. After one hour, my chatbot UI cannot query this RAG model endpoint and received this error message: 


HTTP error occurred: 400 Client Error: Bad Request for url: https://dbc-xxxxx.cloud.databricks.com/serving-endpoints/chatbot_poc/invocations

🔢Status code: N/A 📄 Response text: No response text 📦 Parsed JSON (if available): {'error_code': 'BAD_REQUEST', 'message': '1 tasks failed. Errors: {0: 'error: Exception("Response content b\'Invalid Token\', status_code 400") Traceback (most recent call last):\n File "/opt/conda/envs/mlflow-env/lib/python3.12/site-packages/databricks/vector_search/utils.py", line 128, in issue_request\n response.raise_for_status()\n File "/opt/conda/envs/mlflow-env/lib/python3.12/site-packages/requests/models.py", line 1024, in raise_for_status\n raise HTTPError(http_error_msg, response=self)\nrequests.exceptions.HTTPError: 400 Client Error: Bad Request for url: https://dbcxxx.cloud.databricks.com/api/2.0/serving-endpoints/databricks-gte-large-en\\n\\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n File "/opt/conda/envs/mlflow-env/lib/python3.12/site-packages/mlflow/langchain/api_request_parallel_processor.py",

 

If I click the bad request for url, then I got the error message:
{"error_code":401,"message":"Credential was not sent or was of an unsupported type for this API. [ReqId: xxxxx-xxxx]"}
 
I have tried all the information I could find but still cannot diagnose the root cause. It seems that some mlflow env configuration related to databricks-gte-large-en model serving endpoint. But it is not managed by myself and I could not find the source code of this API. I also tried to use PAT to query the RAG endpoint, still faced the same error. 
 
Could you help me diagnose this error? Since I need to deploy this Chatbot App to our internal team, the RAG model serving endpoint fails after one hour and has to be restart manually is definitely not a solution we can use.
 
Thanks!
 
Regards,
Jing Xie 
2 REPLIES 2

stbjelcevic
Databricks Employee
Databricks Employee

Hi @JingXie,

This is a classic authentication token issue, but it's happening in a different place than you might think. The Invalid Token error is not coming from your Streamlit App's call to your chatbot_poc endpoint.

It's coming from inside your chatbot_poc endpoint when it tries to call the databricks-gte-large-en embedding model endpoint.

Here is the chain of events and where the failure occurs:

  1. Streamlit App to RAG Endpoint: Your Streamlit App (running as a Databricks App) uses its Service Principal (SP) and your Token Manager to get a token. It successfully sends a request to .../chatbot_poc/invocations. This part works.

  2. RAG Endpoint to Embedding Endpoint :

    • Your chatbot_poc model (the pyfunc RAG chain) receives the request.

    • To process the query, it must first embed the user's text. To do this, it makes its own API call to the databricks-gte-large-en foundation model endpoint.

    • This is where it fails. The error 400 Client Error: Bad Request for url: https://dbcxxx.cloud.databricks.com/api/2.0/serving-endpoints/databricks-gte-large-en and the content b'Invalid Token' prove this.

The root cause is that the identity (token) your chatbot_poc endpoint is using to call the embedding endpoint expires after one hour. This identity was likely "baked in" when you logged the model to MLflow. You probably initialized the DatabricksEmbeddings client in your notebook, which captured your notebook's temporary 1-hour token. When the model serving endpoint starts, it uses this stale, captured token, which works for an hour and then dies.

In cell 8 of notebook 02-Deploy-RAG-Chatbot-Model that it warns of using notebook auth (see attached screenshot).

Your Streamlit app's Token Manager is irrelevant here; it only manages the token for step 1. You need to fix the token for step 2.

Recommended Solution: Use a Service Principal for Model Serving

The most robust and standard solution is to configure your chatbot_poc Model Serving endpoint to "run as" a Service Principal.

This gives the endpoint its own identity. When it needs to call other Databricks services (like the embedding endpoint or the Llama 3 endpoint), it will use this SP identity to automatically generate a fresh, short-lived token for every single call. This completely eliminates the 1-hour expiry problem.

Step-by-Step Fix

Here’s how to implement this fix:

1. Ensure You Have a Service Principal (SP)

You probably already have one for your Databricks App. You can use the same one. If not, create a new one and record its Application ID.

2. Grant the SP "Can Query" Permissions

This is the most critical step. Your Service Principal needs permission to call the other model endpoints.

  1. Go to Machine Learning > Serving.

  2. Find the databricks-gte-large-en endpoint.

  3. Click it, then go to the Permissions tab.

  4. Add your Service Principal and give it the Can Query permission.

  5. Repeat this process for your LLM endpoint: databricks-meta-llama-3-1-70b-instruct.

  6. Repeat this process for your Vector Search endpoint (if it's a separate serving endpoint).

3. Re-log Your RAG Chain (A Quick Check)

This is a sanity check to ensure you aren't hardcoding any tokens. Your code for initializing the LangChain components inside your model-logging notebook should look like this:

 
from langchain_community.chat_models import ChatDatabricks
from langchain_community.embeddings import DatabricksEmbeddings

# CORRECT: Just specify the endpoint name.
# DO NOT pass a 'token', 'api_token', or 'host'.
# The library will automatically find them when running on Databricks.

embed_model = DatabricksEmbeddings(endpoint="databricks-gte-large-en")

llm = ChatDatabricks(
    endpoint="databricks-meta-llama-3-1-70b-instruct",
    max_tokens=250
)

# ... your vector store and chain definitions ...

# Log the model
mlflow.langchain.log_model(chain, ...)

If your code was passing a token (e.g., from dbutils), remove it and re-log your model to UC.

4. Re-configure Your chatbot_poc Endpoint

This is the final step to tie it all together.

  1. Go to Machine Learning > Serving.

  2. Find and click on your chatbot_poc endpoint.

  3. Click the Edit button (or "Edit configuration").

  4. Look for the "Run as" setting. By default, it's probably set to your user account ([user]@domain.com).

  5. Change this setting from your user to the Service Principal you configured in Step 2.

  6. Save the endpoint configuration.

The endpoint will restart. Once it's ready, it will now be running as the Service Principal. When your Streamlit app calls it, your RAG chain's internal calls to the embedding and LLM endpoints will use the SP's identity to generate new tokens on the fly, and the 1-hour expiry problem will be gone.

Please let me know if the above does not solve your issue, and share any additional information you are seeing to help me diagnose what else might be going wrong.

Thank you!

 

JingXie
New Contributor II

Hi, 

Thank you very much for providing such comprehensive explanation and step-by-step solution! 

I will follow your instruction to fix it. If I face any error, I will keep you posted. 

Best regards, 

Jing Xie