โ11-07-2025 04:00 AM
Hi Databricks Team / Community,
Iโm encountering a 500 Internal Server Error when calling an Agent Bricks MAS endpoint in my workspace. The error message is:
500 Internal Error. Please try again later. If this issue persists, please contact Databricks support.
Context:
Troubleshooting Iโve Tried:
I would appreciate guidance on:
Thank you in advance for any help or insights!
โ11-08-2025 02:00 PM
Hi @tsukitsune , thanks for the detailed contextโhereโs a concise set of causes, diagnostics, and workarounds to get your multi-agent supervisor stable.
Missing or misconfigured Agent Framework OnโBehalfโOf (OBO) Authorization. MAS invokes subโagents with the callerโs permissions; OBO must be enabled and the MAS reโcreated after toggling it.
Subโagent uses a disabled payโasโyouโgo (PayGo) model (e.g., Claude) or a model thatโs not allowed in the workspace; MAS logs show PERMISSION_DENIED/Model disabled and bubble up as 500.
Intermittent infra issues or a prior MAS bug around parallel tool calls; a fix was shippedโupdating the endpoint resolved repeated 500s in multiple workspaces.
Rate limiting can surface as 500 in some paths; ensure AI Gateway rate limits arenโt being hit by MAS traffic.
Serverless compute dependency missing in the workspace (MAS relies on serverless model serving).
Payload/response size or execution limits exceeded during orchestration (e.g., Genie returning large intermediate results). For agents, request payload limit is 4 MB, and responses >1 MB arenโt logged; max execution time per request is 297s.
Using unsupported subโagent types. MAS currently supports Agent Bricks: Knowledge Assistant endpoints (plus Genie, UC functions, and MCP servers). Custom code agents not created via Knowledge Assistant are not supported as โAgent Endpointโ in the MAS UI.
Pull model server logs for the served MAS entity via REST; these show runtime errors that lead to 500s:
# Served model logs
curl -H "Authorization: Bearer $TOKEN" \
"https://<workspace-host>/api/2.0/serving-endpoints/<mas-endpoint>/served-models/<served-model-name>/logs?config_version=1"
And container build logs:
curl -H "Authorization: Bearer $TOKEN" \
"https://<workspace-host>/api/2.0/serving-endpoints/<mas-endpoint>/served-models/<served-model-name>/build-logs?config_version=1"
Enable AI Gateway inference tables on the MAS endpoint; these log request/response payloads and MLflow traces for agents. Note: logging is bestโeffort and may not populate for 500s; payloads >1 MiB wonโt be logged.
Use MLflow 3 realโtime tracing for agent observability; MAS and subโagents log traces to an experiment and optionally to Delta tables for production monitoring.
Check endpoint health metrics (latency, error rate, QPS) and service logs in the Serving UI for runtime behavior and failures.
MAS supports up to 10 agents/tools; ensure each end user has explicit access to every subโagent (CAN QUERY for KA, Share for Genie, EXECUTE for UC functions, USE CONNECTION for MCP).
Knowledge Assistant embedding endpoint (databricksโgteโlargeโen) must have AI Guardrails and rate limits disabled for ingestion; confirm this in Gateway settings.
MAS was not designed to pass large dataframes between Genie spaces; it routes and consolidates answers. If your Genie agent produces large intermediate data (e.g., 5000ร22 rows), downโsample/summarize inโagent, or narrow the query so MAS handles smaller responses.
If OBO was toggled or workspace settings changed, reโcreate MAS so it picks up auth and routing changes; also click Update Agent (or update the endpoint) to pull recent orchestration fixes that eliminated parallelโcall 500s.
Verify PayGo models are permitted if a subโagent relies on firstโparty Claude/OpenAI endpoints; otherwise replace with allowed models or enable PayGo in the workspace.
Confirm Agent Framework OBO is enabled and the MAS was reโcreated after enabling it; retest.
Validate all subโagents are supported (KA endpoints, Genie rooms, UC functions, MCP servers) and end user permissions are set (CAN QUERY/Share/EXECUTE/USE CONNECTION).
Update the MAS endpoint (Configure tab โ Update Agent) and retest to pick up the fix for parallel toolโcalling 500s.
Review Gateway rate limits and disable limits temporarily to rule out throttling; then reโapply with safe headroom.
Keep MAS and subโagent request payloads under 4 MB and design Genie steps to summarize large outputs before returning to MAS.
Pull servedโmodel logs and build logs via the REST calls above; also enable inference tables and realโtime MLflow tracing for deeper RCA.
Hope this helps you get to a sound resolution.
Cheers, Louis.
โ11-08-2025 02:00 PM
Hi @tsukitsune , thanks for the detailed contextโhereโs a concise set of causes, diagnostics, and workarounds to get your multi-agent supervisor stable.
Missing or misconfigured Agent Framework OnโBehalfโOf (OBO) Authorization. MAS invokes subโagents with the callerโs permissions; OBO must be enabled and the MAS reโcreated after toggling it.
Subโagent uses a disabled payโasโyouโgo (PayGo) model (e.g., Claude) or a model thatโs not allowed in the workspace; MAS logs show PERMISSION_DENIED/Model disabled and bubble up as 500.
Intermittent infra issues or a prior MAS bug around parallel tool calls; a fix was shippedโupdating the endpoint resolved repeated 500s in multiple workspaces.
Rate limiting can surface as 500 in some paths; ensure AI Gateway rate limits arenโt being hit by MAS traffic.
Serverless compute dependency missing in the workspace (MAS relies on serverless model serving).
Payload/response size or execution limits exceeded during orchestration (e.g., Genie returning large intermediate results). For agents, request payload limit is 4 MB, and responses >1 MB arenโt logged; max execution time per request is 297s.
Using unsupported subโagent types. MAS currently supports Agent Bricks: Knowledge Assistant endpoints (plus Genie, UC functions, and MCP servers). Custom code agents not created via Knowledge Assistant are not supported as โAgent Endpointโ in the MAS UI.
Pull model server logs for the served MAS entity via REST; these show runtime errors that lead to 500s:
# Served model logs
curl -H "Authorization: Bearer $TOKEN" \
"https://<workspace-host>/api/2.0/serving-endpoints/<mas-endpoint>/served-models/<served-model-name>/logs?config_version=1"
And container build logs:
curl -H "Authorization: Bearer $TOKEN" \
"https://<workspace-host>/api/2.0/serving-endpoints/<mas-endpoint>/served-models/<served-model-name>/build-logs?config_version=1"
Enable AI Gateway inference tables on the MAS endpoint; these log request/response payloads and MLflow traces for agents. Note: logging is bestโeffort and may not populate for 500s; payloads >1 MiB wonโt be logged.
Use MLflow 3 realโtime tracing for agent observability; MAS and subโagents log traces to an experiment and optionally to Delta tables for production monitoring.
Check endpoint health metrics (latency, error rate, QPS) and service logs in the Serving UI for runtime behavior and failures.
MAS supports up to 10 agents/tools; ensure each end user has explicit access to every subโagent (CAN QUERY for KA, Share for Genie, EXECUTE for UC functions, USE CONNECTION for MCP).
Knowledge Assistant embedding endpoint (databricksโgteโlargeโen) must have AI Guardrails and rate limits disabled for ingestion; confirm this in Gateway settings.
MAS was not designed to pass large dataframes between Genie spaces; it routes and consolidates answers. If your Genie agent produces large intermediate data (e.g., 5000ร22 rows), downโsample/summarize inโagent, or narrow the query so MAS handles smaller responses.
If OBO was toggled or workspace settings changed, reโcreate MAS so it picks up auth and routing changes; also click Update Agent (or update the endpoint) to pull recent orchestration fixes that eliminated parallelโcall 500s.
Verify PayGo models are permitted if a subโagent relies on firstโparty Claude/OpenAI endpoints; otherwise replace with allowed models or enable PayGo in the workspace.
Confirm Agent Framework OBO is enabled and the MAS was reโcreated after enabling it; retest.
Validate all subโagents are supported (KA endpoints, Genie rooms, UC functions, MCP servers) and end user permissions are set (CAN QUERY/Share/EXECUTE/USE CONNECTION).
Update the MAS endpoint (Configure tab โ Update Agent) and retest to pick up the fix for parallel toolโcalling 500s.
Review Gateway rate limits and disable limits temporarily to rule out throttling; then reโapply with safe headroom.
Keep MAS and subโagent request payloads under 4 MB and design Genie steps to summarize large outputs before returning to MAS.
Pull servedโmodel logs and build logs via the REST calls above; also enable inference tables and realโtime MLflow tracing for deeper RCA.
Hope this helps you get to a sound resolution.
Cheers, Louis.
โ11-15-2025 03:56 AM
Thanks @Louis_Frolio for the detailed response! The first tip on turning on the Agent Framework OnโBehalfโOf (OBO) Authorization resolved the issue. Cheers mate!
โ11-15-2025 11:35 AM
Glad you found a resolution! Cheers, Louis.
โ11-17-2025 03:00 AM
Hello, i'm facing the same issue while testing sample queries in the "Test your Agent" box.
Could anyone plese help me with the process of enabling OBO authorization
โ11-17-2025 03:36 AM
Update: Enabled OBO authorization but it still doesn't seem to resolve the issue. Also cross checked compute and other requirements.
โ11-17-2025 04:37 AM
@shivamrai162 , Did you recreate the agent after enabling the preview?
โ11-17-2025 08:14 PM
Thanks Kaushal, I tried recreating it again and its working now.
โ11-17-2025 08:42 PM
Good to know it's working now @shivamrai162