Databricks Community

tsukitsune · ‎11-07-2025

Hi Databricks Team / Community,

I’m encountering a 500 Internal Server Error when calling an Agent Bricks MAS endpoint in my workspace. The error message is:

500 Internal Error. Please try again later. If this issue persists, please contact Databricks support.

Context:

I have deployed a multi-agent supervisor using Agent Bricks and exposed it as a serving endpoint.
I tried with 1~3 agents and all of them give the same error. Testing the agent endpoints separately works fine.

Troubleshooting I’ve Tried:

Verified workspace permissions; the token/user has access to all referenced models and tools.
Checked cluster status; compute resources appear healthy.
Re-deployed the endpoint to ensure the latest agent version is active.
Tested with smaller payloads.

I would appreciate guidance on:

What could cause a 500 Internal Error in Agent Bricks endpoints?
How to reliably debug or capture detailed logs for such failures.
Any known limitations or workarounds for multi-agent endpoints causing 500 errors.

Thank you in advance for any help or insights!

Louis_Frolio · ‎11-08-2025

Hi @tsukitsune , thanks for the detailed context—here’s a concise set of causes, diagnostics, and workarounds to get your multi-agent supervisor stable.

Likely root causes of 500 on a Multi‑Agent Supervisor (MAS)

Missing or misconfigured Agent Framework On‑Behalf‑Of (OBO) Authorization. MAS invokes sub‑agents with the caller’s permissions; OBO must be enabled and the MAS re‑created after toggling it.
Sub‑agent uses a disabled pay‑as‑you‑go (PayGo) model (e.g., Claude) or a model that’s not allowed in the workspace; MAS logs show PERMISSION_DENIED/Model disabled and bubble up as 500.
Intermittent infra issues or a prior MAS bug around parallel tool calls; a fix was shipped—updating the endpoint resolved repeated 500s in multiple workspaces.
Rate limiting can surface as 500 in some paths; ensure AI Gateway rate limits aren’t being hit by MAS traffic.
Serverless compute dependency missing in the workspace (MAS relies on serverless model serving).
Payload/response size or execution limits exceeded during orchestration (e.g., Genie returning large intermediate results). For agents, request payload limit is 4 MB, and responses >1 MB aren’t logged; max execution time per request is 297s.

3 sources
Using unsupported sub‑agent types. MAS currently supports Agent Bricks: Knowledge Assistant endpoints (plus Genie, UC functions, and MCP servers). Custom code agents not created via Knowledge Assistant are not supported as “Agent Endpoint” in the MAS UI.

How to capture detailed logs and debug reliably

Pull model server logs for the served MAS entity via REST; these show runtime errors that lead to 500s:

# Served model logs
curl -H "Authorization: Bearer $TOKEN" \
  "https://<workspace-host>/api/2.0/serving-endpoints/<mas-endpoint>/served-models/<served-model-name>/logs?config_version=1"

And container build logs:

curl -H "Authorization: Bearer $TOKEN" \
  "https://<workspace-host>/api/2.0/serving-endpoints/<mas-endpoint>/served-models/<served-model-name>/build-logs?config_version=1"

Enable AI Gateway inference tables on the MAS endpoint; these log request/response payloads and MLflow traces for agents. Note: logging is best‑effort and may not populate for 500s; payloads >1 MiB won’t be logged.
Use MLflow 3 real‑time tracing for agent observability; MAS and sub‑agents log traces to an experiment and optionally to Delta tables for production monitoring.
Check endpoint health metrics (latency, error rate, QPS) and service logs in the Serving UI for runtime behavior and failures.

Known limitations and recommended workarounds

MAS supports up to 10 agents/tools; ensure each end user has explicit access to every sub‑agent (CAN QUERY for KA, Share for Genie, EXECUTE for UC functions, USE CONNECTION for MCP).
Knowledge Assistant embedding endpoint (databricks‑gte‑large‑en) must have AI Guardrails and rate limits disabled for ingestion; confirm this in Gateway settings.
MAS was not designed to pass large dataframes between Genie spaces; it routes and consolidates answers. If your Genie agent produces large intermediate data (e.g., 5000×22 rows), down‑sample/summarize in‑agent, or narrow the query so MAS handles smaller responses.
If OBO was toggled or workspace settings changed, re‑create MAS so it picks up auth and routing changes; also click Update Agent (or update the endpoint) to pull recent orchestration fixes that eliminated parallel‑call 500s.
Verify PayGo models are permitted if a sub‑agent relies on first‑party Claude/OpenAI endpoints; otherwise replace with allowed models or enable PayGo in the workspace.

Fast checklist to isolate your case

Confirm Agent Framework OBO is enabled and the MAS was re‑created after enabling it; retest.
Validate all sub‑agents are supported (KA endpoints, Genie rooms, UC functions, MCP servers) and end user permissions are set (CAN QUERY/Share/EXECUTE/USE CONNECTION).
Update the MAS endpoint (Configure tab → Update Agent) and retest to pick up the fix for parallel tool‑calling 500s.
Review Gateway rate limits and disable limits temporarily to rule out throttling; then re‑apply with safe headroom.
Keep MAS and sub‑agent request payloads under 4 MB and design Genie steps to summarize large outputs before returning to MAS.
Pull served‑model logs and build logs via the REST calls above; also enable inference tables and real‑time MLflow tracing for deeper RCA.

4 sources

Hope this helps you get to a sound resolution.

Cheers, Louis.

View solution in original post

Louis_Frolio · ‎11-08-2025

Hi @tsukitsune , thanks for the detailed context—here’s a concise set of causes, diagnostics, and workarounds to get your multi-agent supervisor stable.

Likely root causes of 500 on a Multi‑Agent Supervisor (MAS)

Missing or misconfigured Agent Framework On‑Behalf‑Of (OBO) Authorization. MAS invokes sub‑agents with the caller’s permissions; OBO must be enabled and the MAS re‑created after toggling it.
Sub‑agent uses a disabled pay‑as‑you‑go (PayGo) model (e.g., Claude) or a model that’s not allowed in the workspace; MAS logs show PERMISSION_DENIED/Model disabled and bubble up as 500.
Intermittent infra issues or a prior MAS bug around parallel tool calls; a fix was shipped—updating the endpoint resolved repeated 500s in multiple workspaces.
Rate limiting can surface as 500 in some paths; ensure AI Gateway rate limits aren’t being hit by MAS traffic.
Serverless compute dependency missing in the workspace (MAS relies on serverless model serving).
Payload/response size or execution limits exceeded during orchestration (e.g., Genie returning large intermediate results). For agents, request payload limit is 4 MB, and responses >1 MB aren’t logged; max execution time per request is 297s.

3 sources
Using unsupported sub‑agent types. MAS currently supports Agent Bricks: Knowledge Assistant endpoints (plus Genie, UC functions, and MCP servers). Custom code agents not created via Knowledge Assistant are not supported as “Agent Endpoint” in the MAS UI.

How to capture detailed logs and debug reliably

Pull model server logs for the served MAS entity via REST; these show runtime errors that lead to 500s:

# Served model logs
curl -H "Authorization: Bearer $TOKEN" \
  "https://<workspace-host>/api/2.0/serving-endpoints/<mas-endpoint>/served-models/<served-model-name>/logs?config_version=1"

And container build logs:

curl -H "Authorization: Bearer $TOKEN" \
  "https://<workspace-host>/api/2.0/serving-endpoints/<mas-endpoint>/served-models/<served-model-name>/build-logs?config_version=1"

Enable AI Gateway inference tables on the MAS endpoint; these log request/response payloads and MLflow traces for agents. Note: logging is best‑effort and may not populate for 500s; payloads >1 MiB won’t be logged.
Use MLflow 3 real‑time tracing for agent observability; MAS and sub‑agents log traces to an experiment and optionally to Delta tables for production monitoring.
Check endpoint health metrics (latency, error rate, QPS) and service logs in the Serving UI for runtime behavior and failures.

Known limitations and recommended workarounds

MAS supports up to 10 agents/tools; ensure each end user has explicit access to every sub‑agent (CAN QUERY for KA, Share for Genie, EXECUTE for UC functions, USE CONNECTION for MCP).
Knowledge Assistant embedding endpoint (databricks‑gte‑large‑en) must have AI Guardrails and rate limits disabled for ingestion; confirm this in Gateway settings.
MAS was not designed to pass large dataframes between Genie spaces; it routes and consolidates answers. If your Genie agent produces large intermediate data (e.g., 5000×22 rows), down‑sample/summarize in‑agent, or narrow the query so MAS handles smaller responses.
If OBO was toggled or workspace settings changed, re‑create MAS so it picks up auth and routing changes; also click Update Agent (or update the endpoint) to pull recent orchestration fixes that eliminated parallel‑call 500s.
Verify PayGo models are permitted if a sub‑agent relies on first‑party Claude/OpenAI endpoints; otherwise replace with allowed models or enable PayGo in the workspace.

Fast checklist to isolate your case

Confirm Agent Framework OBO is enabled and the MAS was re‑created after enabling it; retest.
Validate all sub‑agents are supported (KA endpoints, Genie rooms, UC functions, MCP servers) and end user permissions are set (CAN QUERY/Share/EXECUTE/USE CONNECTION).
Update the MAS endpoint (Configure tab → Update Agent) and retest to pick up the fix for parallel tool‑calling 500s.
Review Gateway rate limits and disable limits temporarily to rule out throttling; then re‑apply with safe headroom.
Keep MAS and sub‑agent request payloads under 4 MB and design Genie steps to summarize large outputs before returning to MAS.
Pull served‑model logs and build logs via the REST calls above; also enable inference tables and real‑time MLflow tracing for deeper RCA.

4 sources

Hope this helps you get to a sound resolution.

Cheers, Louis.

tsukitsune · ‎11-15-2025

Thanks @Louis_Frolio for the detailed response! The first tip on turning on the Agent Framework On‑Behalf‑Of (OBO) Authorization resolved the issue. Cheers mate!

Louis_Frolio · ‎11-15-2025

Glad you found a resolution! Cheers, Louis.

shivamrai162 · ‎11-17-2025

Hello, i'm facing the same issue while testing sample queries in the "Test your Agent" box.

Could anyone plese help me with the process of enabling OBO authorization

shivamrai162 · ‎11-17-2025

Update: Enabled OBO authorization but it still doesn't seem to resolve the issue. Also cross checked compute and other requirements.

KaushalVachhani · ‎11-17-2025

@shivamrai162 , Did you recreate the agent after enabling the preview?

shivamrai162 · ‎11-17-2025

Thanks Kaushal, I tried recreating it again and its working now.

KaushalVachhani · ‎11-17-2025

Good to know it's working now @shivamrai162

Databricks Community

Agent Bricks - MAS 500 Internal error

Likely root causes of 500 on a Multi‑Agent Supervisor (MAS)

How to capture detailed logs and debug reliably

Known limitations and recommended workarounds

Fast checklist to isolate your case

Likely root causes of 500 on a Multi‑Agent Supervisor (MAS)

How to capture detailed logs and debug reliably

Known limitations and recommended workarounds

Fast checklist to isolate your case

Join Us as a Local Community Builder!

🌟 Community Pulse: Your Weekly Roundup! December 12 – 21, 2025

PSA: Community Edition retires on January 1, 2026. Move to the Free Edition today to keep your work.

🎤 Call for Presentations: Data + AI Summit 2026 is Open!

Last Chance: Help Shape the 2026 Data + AI Summit | Win a Full Conference Pass

Celebrating Our First Brickster Champion: Louis Frolio