cancel
Showing results forĀ 
Search instead forĀ 
Did you mean:Ā 
Generative AI
Explore discussions on generative artificial intelligence techniques and applications within the Databricks Community. Share ideas, challenges, and breakthroughs in this cutting-edge field.
cancel
Showing results forĀ 
Search instead forĀ 
Did you mean:Ā 

Agent Bricks - MAS 500 Internal error

tsukitsune
New Contributor

Hi Databricks Team / Community,

I’m encountering a 500 Internal Server Error when calling an Agent Bricks MAS endpoint in my workspace. The error message is:

500 Internal Error. Please try again later. If this issue persists, please contact Databricks support.

Context:

  • I have deployed a multi-agent supervisor using Agent Bricks and exposed it as a serving endpoint.
  • I tried with 1~3 agents and all of them give the same error. Testing the agent endpoints separately works fine.

Troubleshooting I’ve Tried:

  • Verified workspace permissions; the token/user has access to all referenced models and tools.
  • Checked cluster status; compute resources appear healthy.
  • Re-deployed the endpoint to ensure the latest agent version is active.
  • Tested with smaller payloads.

I would appreciate guidance on:

  1. What could cause a 500 Internal Error in Agent Bricks endpoints?
  2. How to reliably debug or capture detailed logs for such failures.
  3. Any known limitations or workarounds for multi-agent endpoints causing 500 errors.

Thank you in advance for any help or insights!

1 REPLY 1

Louis_Frolio
Databricks Employee
Databricks Employee

Hi @tsukitsune ,  thanks for the detailed context—here’s a concise set of causes, diagnostics, and workarounds to get your multi-agent supervisor stable.

Likely root causes of 500 on a Multi‑Agent Supervisor (MAS)

  • Missing or misconfigured Agent Framework On‑Behalf‑Of (OBO) Authorization. MAS invokes sub‑agents with the caller’s permissions; OBO must be enabled and the MAS re‑created after toggling it.

  • Sub‑agent uses a disabled pay‑as‑you‑go (PayGo) model (e.g., Claude) or a model that’s not allowed in the workspace; MAS logs show PERMISSION_DENIED/Model disabled and bubble up as 500.

  • Intermittent infra issues or a prior MAS bug around parallel tool calls; a fix was shipped—updating the endpoint resolved repeated 500s in multiple workspaces.

  • Rate limiting can surface as 500 in some paths; ensure AI Gateway rate limits aren’t being hit by MAS traffic.

  • Serverless compute dependency missing in the workspace (MAS relies on serverless model serving).

  • Payload/response size or execution limits exceeded during orchestration (e.g., Genie returning large intermediate results). For agents, request payload limit is 4 MB, and responses >1 MB aren’t logged; max execution time per request is 297s.

    3 sources
  • Using unsupported sub‑agent types. MAS currently supports Agent Bricks: Knowledge Assistant endpoints (plus Genie, UC functions, and MCP servers). Custom code agents not created via Knowledge Assistant are not supported as ā€œAgent Endpointā€ in the MAS UI.

How to capture detailed logs and debug reliably

  • Pull model server logs for the served MAS entity via REST; these show runtime errors that lead to 500s:

    # Served model logs
    curl -H "Authorization: Bearer $TOKEN" \
      "https://<workspace-host>/api/2.0/serving-endpoints/<mas-endpoint>/served-models/<served-model-name>/logs?config_version=1"

    And container build logs:

    curl -H "Authorization: Bearer $TOKEN" \
      "https://<workspace-host>/api/2.0/serving-endpoints/<mas-endpoint>/served-models/<served-model-name>/build-logs?config_version=1"
  • Enable AI Gateway inference tables on the MAS endpoint; these log request/response payloads and MLflow traces for agents. Note: logging is best‑effort and may not populate for 500s; payloads >1 MiB won’t be logged.

  • Use MLflow 3 real‑time tracing for agent observability; MAS and sub‑agents log traces to an experiment and optionally to Delta tables for production monitoring.

  • Check endpoint health metrics (latency, error rate, QPS) and service logs in the Serving UI for runtime behavior and failures.

Known limitations and recommended workarounds

  • MAS supports up to 10 agents/tools; ensure each end user has explicit access to every sub‑agent (CAN QUERY for KA, Share for Genie, EXECUTE for UC functions, USE CONNECTION for MCP).

  • Knowledge Assistant embedding endpoint (databricks‑gte‑large‑en) must have AI Guardrails and rate limits disabled for ingestion; confirm this in Gateway settings.

  • MAS was not designed to pass large dataframes between Genie spaces; it routes and consolidates answers. If your Genie agent produces large intermediate data (e.g., 5000Ɨ22 rows), down‑sample/summarize in‑agent, or narrow the query so MAS handles smaller responses.

  • If OBO was toggled or workspace settings changed, re‑create MAS so it picks up auth and routing changes; also click Update Agent (or update the endpoint) to pull recent orchestration fixes that eliminated parallel‑call 500s.

  • Verify PayGo models are permitted if a sub‑agent relies on first‑party Claude/OpenAI endpoints; otherwise replace with allowed models or enable PayGo in the workspace.

Fast checklist to isolate your case

  • Confirm Agent Framework OBO is enabled and the MAS was re‑created after enabling it; retest.

  • Validate all sub‑agents are supported (KA endpoints, Genie rooms, UC functions, MCP servers) and end user permissions are set (CAN QUERY/Share/EXECUTE/USE CONNECTION).

  • Update the MAS endpoint (Configure tab → Update Agent) and retest to pick up the fix for parallel tool‑calling 500s.

  • Review Gateway rate limits and disable limits temporarily to rule out throttling; then re‑apply with safe headroom.

  • Keep MAS and sub‑agent request payloads under 4 MB and design Genie steps to summarize large outputs before returning to MAS.

  • Pull served‑model logs and build logs via the REST calls above; also enable inference tables and real‑time MLflow tracing for deeper RCA.

    4 sources

     

Hope this helps you get to a sound resolution.

Cheers, Louis.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now