cancel
Showing results for 
Search instead for 
Did you mean: 
Generative AI
Explore discussions on generative artificial intelligence techniques and applications within the Databricks Community. Share ideas, challenges, and breakthroughs in this cutting-edge field.
cancel
Showing results for 
Search instead for 
Did you mean: 

FMAPI Anthropic endpoint rejects requests with trailing assistant message — known limitation?

cormierjohn
New Contributor

Hey all — looking for confirmation on a behavior I'm hitting on the Foundation Model API (pay-per-token) Anthropic-compatible endpoint, in case anyone else has worked around it.

What I'm doing: serving Claude models through /serving-endpoints/anthropic/v1/messages on the FMAPI pay-per-token tier. AAD bearer auth, U2M flow.

What fails: any request where the messages array ends with a turn of role: "assistant". The endpoint returns:

HTTP 400 BAD_REQUEST
{
"error_code": "BAD_REQUEST",
"message": "This model does not support assistant message prefill. The conversation must end with a user message."
}

Minimal repro shape:

{
  "model": "databricks-claude-opus-4-7",
  "max_tokens": 256,
  "messages": [
    {"role": "user", "content": "Complete the sentence:"},
    {"role": "assistant", "content": "The capital of France is "}
  ]
}

Native Anthropic accepts this — it's the documented "assistant prefill" pattern where the model continues from where the partial assistant text leaves off. Common uses: forcing output formats, resuming after interruption, certain tool-loop continuations.

Why this is broader than one client: prefill is foundational in the Anthropic ecosystem. The Anthropic Python/TypeScript SDKs, LangChain's Anthropic provider, autogen and most agent frameworks built on the Anthropic API treat it as a primitive. Anything routed to FMAPI Anthropic that uses prefill gets a 400.

What I'm doing today: running a small proxy in front of FMAPI that strips trailing assistant messages before forwarding. Works for cases where prefill is incidental, but silently degrades any client that actually relies on prefill semantics (output-shaping flows especially).

Questions:

  1. Is this a known/documented limitation of the FMAPI Anthropic endpoint?
  2. Is parity with native Anthropic on this feature planned?
  3. Has anyone found an official workaround other than client-side rewriting?

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions

stbjelcevic
Databricks Employee
Databricks Employee

Hi @cormierjohn ,

Your understanding is correct. The validation rejecting a trailing assistant turn is happening at the FMAPI proxy layer before the request reaches Claude, so any client that uses Anthropic's prefill primitive will 400 against this endpoint today. Quick pass on your three questions:

  1. Known limitation? Yes. It isn't called out as a feature gap in the FMAPI docs that I can point to, but the error string is purpose-built rather than incidental, so it's an intentional constraint of the current Anthropic-compatible surface, not a transient bug.
  2. Parity planned? Nothing I can share publicly on roadmap. If you want it tracked, the most reliable path is to file a feature request through your Databricks account team or via support so it lands in the FMAPI team's intake with a customer-attached use case.
  3. Workarounds beyond client-side rewriting? A few that may cover specific use cases:
    • Reframe prefill as a user instruction. Move the partial assistant text into the final user turn ("Continue from exactly: 'The capital of France is '"). Imperfect, but preserves FMAPI routing for incidental prefill.
    • Use stop_sequences + post-processing for output-shaping cases where prefill was only being used to constrain format.
    • Route prefill-dependent traffic to Anthropic directly for the specific flows that genuinely need prefill semantics (tool-loop continuations, strict structured output), keep the rest on FMAPI for governance/billing. Two-lane is uglier than one, but it's the only path today that preserves prefill behavior exactly.

Your stripping proxy is a reasonable bridge for the incidental cases. If you go that route, I'd log every time a trailing assistant turn gets dropped so you can quantify which clients are silently degraded and decide which ones move to the second lane.

View solution in original post

1 REPLY 1

stbjelcevic
Databricks Employee
Databricks Employee

Hi @cormierjohn ,

Your understanding is correct. The validation rejecting a trailing assistant turn is happening at the FMAPI proxy layer before the request reaches Claude, so any client that uses Anthropic's prefill primitive will 400 against this endpoint today. Quick pass on your three questions:

  1. Known limitation? Yes. It isn't called out as a feature gap in the FMAPI docs that I can point to, but the error string is purpose-built rather than incidental, so it's an intentional constraint of the current Anthropic-compatible surface, not a transient bug.
  2. Parity planned? Nothing I can share publicly on roadmap. If you want it tracked, the most reliable path is to file a feature request through your Databricks account team or via support so it lands in the FMAPI team's intake with a customer-attached use case.
  3. Workarounds beyond client-side rewriting? A few that may cover specific use cases:
    • Reframe prefill as a user instruction. Move the partial assistant text into the final user turn ("Continue from exactly: 'The capital of France is '"). Imperfect, but preserves FMAPI routing for incidental prefill.
    • Use stop_sequences + post-processing for output-shaping cases where prefill was only being used to constrain format.
    • Route prefill-dependent traffic to Anthropic directly for the specific flows that genuinely need prefill semantics (tool-loop continuations, strict structured output), keep the rest on FMAPI for governance/billing. Two-lane is uglier than one, but it's the only path today that preserves prefill behavior exactly.

Your stripping proxy is a reasonable bridge for the incidental cases. If you go that route, I'd log every time a trailing assistant turn gets dropped so you can quantify which clients are silently degraded and decide which ones move to the second lane.