Databricks Community

tsukitsune · 4 weeks ago

Hi Databricks Team / Community,

I’m encountering a 500 Internal Server Error when calling an Agent Bricks MAS endpoint in my workspace. The error message is:

500 Internal Error. Please try again later. If this issue persists, please contact Databricks support.

Context:

I have deployed a multi-agent supervisor using Agent Bricks and exposed it as a serving endpoint.
I tried with 1~3 agents and all of them give the same error. Testing the agent endpoints separately works fine.

Troubleshooting I’ve Tried:

Verified workspace permissions; the token/user has access to all referenced models and tools.
Checked cluster status; compute resources appear healthy.
Re-deployed the endpoint to ensure the latest agent version is active.
Tested with smaller payloads.

I would appreciate guidance on:

What could cause a 500 Internal Error in Agent Bricks endpoints?
How to reliably debug or capture detailed logs for such failures.
Any known limitations or workarounds for multi-agent endpoints causing 500 errors.

Thank you in advance for any help or insights!

Louis_Frolio · 4 weeks ago

Hi @tsukitsune , thanks for the detailed context—here’s a concise set of causes, diagnostics, and workarounds to get your multi-agent supervisor stable.

Likely root causes of 500 on a Multi‑Agent Supervisor (MAS)

Missing or misconfigured Agent Framework On‑Behalf‑Of (OBO) Authorization. MAS invokes sub‑agents with the caller’s permissions; OBO must be enabled and the MAS re‑created after toggling it.
Sub‑agent uses a disabled pay‑as‑you‑go (PayGo) model (e.g., Claude) or a model that’s not allowed in the workspace; MAS logs show PERMISSION_DENIED/Model disabled and bubble up as 500.
Intermittent infra issues or a prior MAS bug around parallel tool calls; a fix was shipped—updating the endpoint resolved repeated 500s in multiple workspaces.
Rate limiting can surface as 500 in some paths; ensure AI Gateway rate limits aren’t being hit by MAS traffic.
Serverless compute dependency missing in the workspace (MAS relies on serverless model serving).
Payload/response size or execution limits exceeded during orchestration (e.g., Genie returning large intermediate results). For agents, request payload limit is 4 MB, and responses >1 MB aren’t logged; max execution time per request is 297s.

3 sources
Using unsupported sub‑agent types. MAS currently supports Agent Bricks: Knowledge Assistant endpoints (plus Genie, UC functions, and MCP servers). Custom code agents not created via Knowledge Assistant are not supported as “Agent Endpoint” in the MAS UI.

How to capture detailed logs and debug reliably

Pull model server logs for the served MAS entity via REST; these show runtime errors that lead to 500s:

# Served model logs
curl -H "Authorization: Bearer $TOKEN" \
  "https://<workspace-host>/api/2.0/serving-endpoints/<mas-endpoint>/served-models/<served-model-name>/logs?config_version=1"

And container build logs:

curl -H "Authorization: Bearer $TOKEN" \
  "https://<workspace-host>/api/2.0/serving-endpoints/<mas-endpoint>/served-models/<served-model-name>/build-logs?config_version=1"

Enable AI Gateway inference tables on the MAS endpoint; these log request/response payloads and MLflow traces for agents. Note: logging is best‑effort and may not populate for 500s; payloads >1 MiB won’t be logged.
Use MLflow 3 real‑time tracing for agent observability; MAS and sub‑agents log traces to an experiment and optionally to Delta tables for production monitoring.
Check endpoint health metrics (latency, error rate, QPS) and service logs in the Serving UI for runtime behavior and failures.

Known limitations and recommended workarounds

MAS supports up to 10 agents/tools; ensure each end user has explicit access to every sub‑agent (CAN QUERY for KA, Share for Genie, EXECUTE for UC functions, USE CONNECTION for MCP).
Knowledge Assistant embedding endpoint (databricks‑gte‑large‑en) must have AI Guardrails and rate limits disabled for ingestion; confirm this in Gateway settings.
MAS was not designed to pass large dataframes between Genie spaces; it routes and consolidates answers. If your Genie agent produces large intermediate data (e.g., 5000×22 rows), down‑sample/summarize in‑agent, or narrow the query so MAS handles smaller responses.
If OBO was toggled or workspace settings changed, re‑create MAS so it picks up auth and routing changes; also click Update Agent (or update the endpoint) to pull recent orchestration fixes that eliminated parallel‑call 500s.
Verify PayGo models are permitted if a sub‑agent relies on first‑party Claude/OpenAI endpoints; otherwise replace with allowed models or enable PayGo in the workspace.

Fast checklist to isolate your case

Confirm Agent Framework OBO is enabled and the MAS was re‑created after enabling it; retest.
Validate all sub‑agents are supported (KA endpoints, Genie rooms, UC functions, MCP servers) and end user permissions are set (CAN QUERY/Share/EXECUTE/USE CONNECTION).
Update the MAS endpoint (Configure tab → Update Agent) and retest to pick up the fix for parallel tool‑calling 500s.
Review Gateway rate limits and disable limits temporarily to rule out throttling; then re‑apply with safe headroom.
Keep MAS and sub‑agent request payloads under 4 MB and design Genie steps to summarize large outputs before returning to MAS.
Pull served‑model logs and build logs via the REST calls above; also enable inference tables and real‑time MLflow tracing for deeper RCA.

4 sources

Hope this helps you get to a sound resolution.

Cheers, Louis.

View solution in original post

Louis_Frolio · 4 weeks ago

Hi @tsukitsune , thanks for the detailed context—here’s a concise set of causes, diagnostics, and workarounds to get your multi-agent supervisor stable.

Likely root causes of 500 on a Multi‑Agent Supervisor (MAS)

Missing or misconfigured Agent Framework On‑Behalf‑Of (OBO) Authorization. MAS invokes sub‑agents with the caller’s permissions; OBO must be enabled and the MAS re‑created after toggling it.
Sub‑agent uses a disabled pay‑as‑you‑go (PayGo) model (e.g., Claude) or a model that’s not allowed in the workspace; MAS logs show PERMISSION_DENIED/Model disabled and bubble up as 500.
Intermittent infra issues or a prior MAS bug around parallel tool calls; a fix was shipped—updating the endpoint resolved repeated 500s in multiple workspaces.
Rate limiting can surface as 500 in some paths; ensure AI Gateway rate limits aren’t being hit by MAS traffic.
Serverless compute dependency missing in the workspace (MAS relies on serverless model serving).
Payload/response size or execution limits exceeded during orchestration (e.g., Genie returning large intermediate results). For agents, request payload limit is 4 MB, and responses >1 MB aren’t logged; max execution time per request is 297s.

3 sources
Using unsupported sub‑agent types. MAS currently supports Agent Bricks: Knowledge Assistant endpoints (plus Genie, UC functions, and MCP servers). Custom code agents not created via Knowledge Assistant are not supported as “Agent Endpoint” in the MAS UI.

How to capture detailed logs and debug reliably

Pull model server logs for the served MAS entity via REST; these show runtime errors that lead to 500s:

# Served model logs
curl -H "Authorization: Bearer $TOKEN" \
  "https://<workspace-host>/api/2.0/serving-endpoints/<mas-endpoint>/served-models/<served-model-name>/logs?config_version=1"

And container build logs:

curl -H "Authorization: Bearer $TOKEN" \
  "https://<workspace-host>/api/2.0/serving-endpoints/<mas-endpoint>/served-models/<served-model-name>/build-logs?config_version=1"

Enable AI Gateway inference tables on the MAS endpoint; these log request/response payloads and MLflow traces for agents. Note: logging is best‑effort and may not populate for 500s; payloads >1 MiB won’t be logged.
Use MLflow 3 real‑time tracing for agent observability; MAS and sub‑agents log traces to an experiment and optionally to Delta tables for production monitoring.
Check endpoint health metrics (latency, error rate, QPS) and service logs in the Serving UI for runtime behavior and failures.

Known limitations and recommended workarounds

MAS supports up to 10 agents/tools; ensure each end user has explicit access to every sub‑agent (CAN QUERY for KA, Share for Genie, EXECUTE for UC functions, USE CONNECTION for MCP).
Knowledge Assistant embedding endpoint (databricks‑gte‑large‑en) must have AI Guardrails and rate limits disabled for ingestion; confirm this in Gateway settings.
MAS was not designed to pass large dataframes between Genie spaces; it routes and consolidates answers. If your Genie agent produces large intermediate data (e.g., 5000×22 rows), down‑sample/summarize in‑agent, or narrow the query so MAS handles smaller responses.
If OBO was toggled or workspace settings changed, re‑create MAS so it picks up auth and routing changes; also click Update Agent (or update the endpoint) to pull recent orchestration fixes that eliminated parallel‑call 500s.
Verify PayGo models are permitted if a sub‑agent relies on first‑party Claude/OpenAI endpoints; otherwise replace with allowed models or enable PayGo in the workspace.

Fast checklist to isolate your case

Confirm Agent Framework OBO is enabled and the MAS was re‑created after enabling it; retest.
Validate all sub‑agents are supported (KA endpoints, Genie rooms, UC functions, MCP servers) and end user permissions are set (CAN QUERY/Share/EXECUTE/USE CONNECTION).
Update the MAS endpoint (Configure tab → Update Agent) and retest to pick up the fix for parallel tool‑calling 500s.
Review Gateway rate limits and disable limits temporarily to rule out throttling; then re‑apply with safe headroom.
Keep MAS and sub‑agent request payloads under 4 MB and design Genie steps to summarize large outputs before returning to MAS.
Pull served‑model logs and build logs via the REST calls above; also enable inference tables and real‑time MLflow tracing for deeper RCA.

4 sources

Hope this helps you get to a sound resolution.

Cheers, Louis.

tsukitsune · 3 weeks ago

Thanks @Louis_Frolio for the detailed response! The first tip on turning on the Agent Framework On‑Behalf‑Of (OBO) Authorization resolved the issue. Cheers mate!

Louis_Frolio · 3 weeks ago

Glad you found a resolution! Cheers, Louis.

shivamrai162 · 2 weeks ago

Hello, i'm facing the same issue while testing sample queries in the "Test your Agent" box.

Could anyone plese help me with the process of enabling OBO authorization

shivamrai162 · 2 weeks ago

Update: Enabled OBO authorization but it still doesn't seem to resolve the issue. Also cross checked compute and other requirements.

KaushalVachhani · 2 weeks ago

@shivamrai162 , Did you recreate the agent after enabling the preview?

shivamrai162 · 2 weeks ago

Thanks Kaushal, I tried recreating it again and its working now.

KaushalVachhani · 2 weeks ago

Good to know it's working now @shivamrai162

Databricks Community

Agent Bricks - MAS 500 Internal error

Likely root causes of 500 on a Multi‑Agent Supervisor (MAS)

How to capture detailed logs and debug reliably

Known limitations and recommended workarounds

Fast checklist to isolate your case

Likely root causes of 500 on a Multi‑Agent Supervisor (MAS)

How to capture detailed logs and debug reliably

Known limitations and recommended workarounds

Fast checklist to isolate your case

Join Us as a Local Community Builder!

Lakehouse, Lagers & Legends — Bangalore Meetup | December 13

🌟 Community Pulse: Your Weekly Roundup! November 21 – 27, 2025

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐