topic Re: Insufficient Permission Error When Serving RAG Model with Multiple Vector Search Indexes in Generative AI

Insufficient Permission Error When Serving RAG Model with Multiple Vector Search Indexes

Karthik_Karanm — Mon, 12 May 2025 16:00:57 GMT

Hi Community,

I’m currently working on a Retrieval-Augmented Generation (RAG) use case in Databricks. I’ve successfully implemented and served a model that uses a single Vector Search index, and everything works as expected.

However, when I try to serve a model that utilizes multiple Vector Search indexes, I encounter the following error during model serving:

mlflow.exceptions.MlflowException: Failed to run user code from /model/model.py. Error: Response content b'{"error_code":"PERMISSION DENIED","message":"Insufficient permissions for UC entity cd.schema.table_vs","details":[{"@type":"type.googleapis.com/google.rpc.RequestInfo","request_id":"b5c11ebd-f66d-4574-9bdc-b89bf6d06339","serving_data":""}]}', status_code 403. Review the stack trace for more information.

"Insufficient permission to the vector search tables"

All the involved vector search indexes are accessible during the indexing and model creation phase. The issue only appears when attempting to serve the model.

Key Observations:

Serving a model with a single vector search index works fine.
Serving a model with multiple vector search indexes leads to a permission error.
The permissions on the individual vector search tables seem to be correctly set, and accessible in other contexts.

Has anyone faced a similar issue or can suggest what specific permissions might be missing when using multiple indexes in a RAG setup?

Thanks in advance!

#Databricks #VectorSearch #RAG #MLflow #ModelServing #DatabricksPermissions #LakehouseAI #GenAI #DatabricksCommunity #MLOps

Re: Insufficient Permission Error When Serving RAG Model with Multiple Vector Search Indexes

lingareddy_Alva — Tue, 13 May 2025 01:12:09 GMT

@Karthik_Karanm

This is a known issue pattern when using multiple Unity Catalog (UC) vector search indexes in
Databricks Model Serving — especially under MLflow model serving endpoints with RAG architecture.

Your model serving environment (i.e., the model inference cluster running the MLflow model)
does not inherit the same permissions that your interactive environment (like a notebook) does. This leads to:
- 403 PERMISSION_DENIED errors from Unity Catalog
- Even though you can query and use those vector search tables during development, the model serving endpoint runs in a separate,
tightly scoped environment, and likely lacks direct access to the underlying Unity Catalog assets (like schema.table_vs)

To resolve this, you'll need to explicitly grant access to the Unity Catalog entities (vector search tables) for the model serving principal.

Re: Insufficient Permission Error When Serving RAG Model with Multiple Vector Search Indexes

Karthik_Karanm — Tue, 13 May 2025 16:02:06 GMT

Hi @lingareddy_Alva,

Thank you for your detailed response — it definitely helped clarify the separation between the interactive environment and the model serving environment in Databricks.

However, I’m still encountering the same issue even though I am the owner of all the involved entities:

The Unity Catalog tables that back the vector search indexes
The Vector Search indexes themselves
The MLflow model and serving endpoints

Re: Insufficient Permission Error When Serving RAG Model with Multiple Vector Search Indexes

lingareddy_Alva — Tue, 13 May 2025 16:25:45 GMT

Hi @Karthik_Karanm

The Model Serving environment runs in an isolated, production-grade context (different compute plane than your interactive workspace).
Even though you own the objects, the serving runtime executes as a system service principal or service identity that:
-- May not inherit your personal workspace permissions
-- Needs explicit permissions granted to access Unity Catalog tables and Vector Search inde

1. Grant Permissions to the Model Serving Identity
You need to manually grant SELECT privileges to the serving identity on:
-- The Vector Search index-backed tables
-- Optionally, the schemas and catalogs themselves if using fine-grained access control

First, identify the serving identity (it might be something like databricks-model-serving)

2. Enable Table ACLs in Unity Catalog (if not already)
Ensure that Table Access Control (Table ACLs) is enabled in the workspace and catalog. You can check this under:
Admin Console → Data → Unity Catalog → Permissions → Table Access Control =On

3. Re-deploy or Rebuild Model After Permissions Update
Sometimes permissions don't take immediate effect for a running model. You may need to:
-- Rebuild and log the MLflow model (if UC tags changed)
-- Delete and redeploy the endpoint
-- Or at minimum, restart the endpoint to clear cached permission
4. Use catalog.table Syntax Explicitly in Model Code
Sometimes, serving context is sensitive to fully qualified names:

Extra Debugging Tip
To simulate the serving context, create a service principal and attach it to a job cluster or notebook using impersonation mode.
If that principal fails with the same error, you've validated the access mismatch.

Re: Insufficient Permission Error When Serving RAG Model with Multiple Vector Search Indexes

Karthik_Karanm — Thu, 15 May 2025 16:16:34 GMT

Hello @lingareddy_Alva
Thank you for your time

Please give me some clarification on this:

The permission error occurred when we used multiple vector searches for a single model. During the model registration process in this scenario, we encountered the error.
However, when we used a single vector search for the same model, the registration completed successfully and everything worked as expected.

Could you please help us understand why this issue occurs only in the first scenario involving multiple vectors?

Re: Insufficient Permission Error When Serving RAG Model with Multiple Vector Search Indexes

lingareddy_Alva — Thu, 15 May 2025 16:56:29 GMT

Hi @Karthik_Karanm

When registering a model that references multiple Unity Catalog tables (backing the vector indexes), Databricks attempts to access and resolve all table metadata during the packaging and validation steps of registration.

Here’s what changes with multiple indexes:

1. Expanded Scope of Access
-- Each Vector Search index is backed by a Delta table in Unity Catalog.
-- Using multiple indexes causes the model registration process to attempt read metadata access across all referenced UC entities.
-- If any of those tables have missing permissions, even temporarily, the registration will fail.

2. Stricter Enforcement in Model Context
-- During interactive development or indexing, you're likely operating under a full-access identity (e.g., your personal workspace or notebook).
-- During model registration, Databricks may execute in a different context (e.g., under the job's service principal or a model registry service identity), which may not have equivalent permissions.

Re: Insufficient Permission Error When Serving RAG Model with Multiple Vector Search Indexes

Karthik_Karanm — Fri, 16 May 2025 15:28:53 GMT

HI @lingareddy_Alva

Databricks recommends using Unity Catalog instead of the legacy Table Access Control (TAC) feature. Enabling Unity Catalog requires configuring extra permissions, such as Cluster Access Control (ACLs).

We want to confirm whether this is the recommended approach moving forward, or if there is an alternative method to achieve the same access control functionality.

Thank you for your time.

Re: Insufficient Permission Error When Serving RAG Model with Multiple Vector Search Indexes

Karthik_Karanm — Fri, 16 May 2025 17:43:03 GMT

HI @lingareddy_Alva
I forgot to mention that I am the metastore admin and workspace admin, and my serving model runs on my user.

Thank you.

Re: Insufficient Permission Error When Serving RAG Model with Multiple Vector Search Indexes

Ramana — Mon, 19 May 2025 15:47:43 GMT

The error was misleading.
It is related to the library we used for agent authoring.
The issue was resolved when we changed the library from langchain_core.runnables to langgraph.graph with some additional code changes.

Here are the reference links:

https://docs.databricks.com/aws/en/generative-ai/agent-framework/log-agent#-specify-resources-for-automatic-authentication-passthrough-system-authentication

https://docs.databricks.com/aws/en/generative-ai/agent-framework/author-agent#chatagent

Kudos to Jackson Turek from Databricks.

Thanks

Ramana

Re: Insufficient Permission Error When Serving RAG Model with Multiple Vector Search Indexes

lingareddy_Alva — Mon, 19 May 2025 16:00:18 GMT

Thank you

Re: Insufficient Permission Error When Serving RAG Model with Multiple Vector Search Indexes

bdroesch — Wed, 21 May 2025 16:10:40 GMT

@Ramana - can you please be more specific with the changes that were required? I am also receiving the same error and was originally using the langchain_core.runnables library and re-worked the code to not rely on it and I am still receiving the same issue when deploying. Agents works fine when running it in my notebook. My original code (listed below) stemmed from the Multi-Agent Genie system example in this link below. I originally had additional nodes including Genie but removed them to try and get the deployment to work for now.

https://docs.databricks.com/aws/en/generative-ai/agent-framework/multi-agent-genie?scid=701Vp000004h4c4IAA&utm_medium=programmatic&utm_source=google&utm_campaign=22507112156&utm_adgroup=&utm_content=summit&utm_offer=dataaisummit&utm_ad=&utm_term=&gad_source=1&gad_campaignid=22507113074&gbraid=0AAAAABYBeAjJBK6Yps_hSSp9sIzsxssUG&gclid=EAIaIQobChMI9-Lhwfi0jQMVXQCtBh3fuDyzEAAYASAAEgLm_PD_BwE

import functools
import os
from typing import Any, Generator, Literal, Optional

import mlflow
from databricks.sdk import WorkspaceClient
from databricks_langchain import ChatDatabricks, VectorSearchRetrieverTool

from databricks_langchain.uc_ai import (
DatabricksFunctionClient,
UCFunctionToolkit,
set_uc_function_client
)

from databricks_langchain.genie import GenieAgent
from langchain_core.runnables import RunnableLambda
from langgraph.graph import END, StateGraph
from langgraph.graph.state import CompiledStateGraph
from langgraph.prebuilt import create_react_agent
from mlflow.langchain.chat_agent_langgraph import ChatAgentState
from mlflow.pyfunc import ChatAgent
from mlflow.types.agent import (
ChatAgentChunk,
ChatAgentMessage,
ChatAgentResponse,
ChatContext,
)
from pydantic import BaseModel

from langchain_openai import OpenAIEmbeddings

mlflow.langchain.autolog()

############################################
############################################

LLM_ENDPOINT_NAME = "XXXX"

llm = ChatDatabricks(endpoint=LLM_ENDPOINT_NAME)

assert LLM_ENDPOINT_NAME is not None

client = DatabricksFunctionClient()
set_uc_function_client(client)

tools = []

# # TODO if desired, add additional tools and update the description of this agent
uc_tool_names = ["system.ai.python_exec"]#,"dev_catalog.tmp.liquidity_calculations"]
uc_toolkit = UCFunctionToolkit(function_names=uc_tool_names)
tools.extend(uc_toolkit.tools)

tools_agent_description = (
"This agent can execute python code and perform data analysis.",
)

embedding_model = OpenAIEmbeddings(model="text-embedding-3-large")

index_name = "dev_catalog.default.earnings_index"
endpoint_name = "earnings_index"

vs_tools = []
vs_agent_description = ("""
The Earnings Vector Search agent has access to a knowledge base of earnings report data related to the company ABC
The knowledge base includes information from 10Ks and Investor Presentations.
Users will want to be able to access numerical data from these reports. This includes servicing related metrics such as delinquency
""")

vs_tool_description=(
"Provide users information from company earnings reports"
"Returns numerical results related to earnings, company performance, delinquency rates, origination volume"
"Avoid any explanation or commentary unless you are unsure "
)

vs_tool = [VectorSearchRetrieverTool(
index_name=index_name, # Index name in the format 'catalog.schema.index'
num_results=4, # Max number of documents to return
query_type="ANN", # Query type ("ANN" or "HYBRID").
tool_name="earnings_reports_vector_search", # Used by the LLM to understand the purpose of the tool
tool_description=vs_tool_description, # Used by the LLM to understand the purpose of the tool
text_column="text", # Specify text column for embeddings. Required for direct-access index or delta-sync index with self-managed embeddings.
embedding=embedding_model # The embedding model. Required for direct-access index or delta-sync index with self-managed embeddings.
)]

vs_tools.extend(vs_tool)

#tools.extend(earnings_vs_tool)

tools_agent = create_react_agent(llm, tools=tools)
vs_agent = create_react_agent(llm, tools=vs_tools)

worker_descriptions = {
"Earnings Vector Search": vs_agent_description,
}

formatted_descriptions = "\n".join(
f"- {name}: {desc}" for name, desc in worker_descriptions.items()
)

system_prompt = f"""You are the supervisor in a multi-agent system. Your job is to route the user's question to the appropriate specialist agent(s).

You may choose from the following workers, or select FINISH if the question has already been fully answered based on data that has been retrieved from Genie queries.

- If multiple agents are needed, route them one at a time and collect their answers before selecting FINISH.
- Maintain awareness of which agents have already responded by reviewing the message history.
- IMPORTANT: DO NOT CALL AN AGENT MORE THAN ONCE

Available agents:
{formatted_descriptions}
"""

options = ["FINISH"] + list(worker_descriptions.keys())

def supervisor_agent(state):
class nextNode(BaseModel):
next_node: Literal[tuple(options)]

preprocessor = RunnableLambda(
lambda state: [{"role": "system", "content": system_prompt}] + state["messages"]
)
supervisor_chain = preprocessor | llm.with_structured_output(nextNode)
return supervisor_chain.invoke(state)

def agent_node(state, agent, name):
result = agent.invoke(state)
return {
"messages": [
{
"role": "assistant",
"content": result["messages"][-1].content,
"name": name,
}
]
}

def final_answer(state):
system_prompt = f'''
Using only the content in the messages, respond to the user's question using the answer given by the other agents.
- You should be trying to create a report framework that has an "Introduction"
- It should then have a section of a high level summary, title this as "Overview"
- Then have a final section that says "Figures" and includes just sub-bullets with any numerical value for each requested metric
'''

preprocessor = RunnableLambda(
lambda state: [{"role": "system", "content": system_prompt}] + state["messages"]
)
final_answer_chain = preprocessor | llm
return {"messages": [final_answer_chain.invoke(state)]}

class AgentState(ChatAgentState):
next_node: str

vs_node = functools.partial(agent_node, agent=vs_agent, name="Earnings Vector Search")

workflow = StateGraph(AgentState)
workflow.add_node("Earnings Vector Search", vs_node)
workflow.add_node("supervisor", supervisor_agent)
workflow.add_node("final_answer", final_answer)

workflow.set_entry_point("supervisor")
# We want our workers to ALWAYS "report back" to the supervisor when done
for worker in worker_descriptions.keys():
workflow.add_edge(worker, "supervisor")

# Let the supervisor decide which next node to go
workflow.add_conditional_edges(
"supervisor",
lambda x: x["next_node"],
{**{k: k for k in worker_descriptions.keys()}, "FINISH": "final_answer"},
)
workflow.add_edge("final_answer", END)
multi_agent = workflow.compile()

class LangGraphChatAgent(ChatAgent):
def __init__(self, agent: CompiledStateGraph):
self.agent = agent

def predict(
self,
messages: list[ChatAgentMessage],
context: Optional[ChatContext] = None,
custom_inputs: Optional[dict[str, Any]] = None,
) -> ChatAgentResponse:
request = {
"messages": [m.model_dump_compat(exclude_none=True) for m in messages]
}

messages = []
for event in self.agent.stream(request, stream_mode="updates"):
for node_data in event.values():
messages.extend(
ChatAgentMessage(**msg) for msg in node_data.get("messages", [])
)
return ChatAgentResponse(messages=messages)

def predict_stream(
self,
messages: list[ChatAgentMessage],
context: Optional[ChatContext] = None,
custom_inputs: Optional[dict[str, Any]] = None,
) -> Generator[ChatAgentChunk, None, None]:
request = {
"messages": [m.model_dump_compat(exclude_none=True) for m in messages]
}
for event in self.agent.stream(request, stream_mode="updates"):
for node_data in event.values():
yield from (
ChatAgentChunk(**{"delta": msg})
for msg in node_data.get("messages", [])
)

# Create the agent object, and specify it as the agent object to use when
# loading the agent back for inference via mlflow.models.set_model()
AGENT = LangGraphChatAgent(multi_agent)
mlflow.models.set_model(AGENT)