Petition to update your documentation about inferencing LLMs and standardize LLM response format

trickywhitecat — Mon, 22 Dec 2025 01:26:58 GMT

TL;DR

Inconsistent & Undocumented Response Format: Contrary to Databricks documentation, enabling thinking.type for Gemini 2.5 Flash changes the ChatCompletion content field from a standard string to a list of dictionaries, breaking expected behavior.
Missing Configuration Documentation: There is no official documentation on how to configure the thinking parameter; users must infer settings from unrelated (Claude) examples.
Breaks OpenAI Compatibility: The altered content format violates the OpenAI schema (which expects Optional[str]), causing errors in standard OpenAI Python packages and tools like MLflow that rely on strict type checking.
Unnecessary Regression: While Google's native Vertex AI and Gemini API provide working OpenAI-compatible endpoints, this specific implementation needlessly breaks compatibility, rendering the "OpenAI-compatible" claim false.

According to your documentation in https://docs.databricks.com/aws/en/machine-learning/model-serving/query-reason-models , one can expect gemini 2.5 flash to have the same response format as the `Gemini model example`, but in reality, gemini 2.5 flash response format will change depending on whether `thinking.type` is set to `enabled` or not:

- If thinking is enabled then `content` field of `ChatCompletion` is a list of dict with key `type` and `text`, and the value for `type` is always "text":

ChatCompletion( id=None, choices=[ Choice( finish_reason="stop", index=0, logprobs=None, message=ChatCompletionMessage( content=[ { "type": "text", "text": "Okay, here's how I'd approach this \"Short testing answer\" request. First, I need to really understand what the user is after. \"Short testing answer\"... that's deliberately vague, isn't it? So, I have to figure out the user's intent, the context, what they *really* need. They're asking for something concise about testing, fine, but what *aspect* of testing? That's the key.\n\nSo, I immediately start breaking down the core principles of testing in my mind. The fundamentals. What *is* testing, really? It's about finding those nasty bugs, those defects that could cripple things later on. But it's more than that; it's about *quality*. Ensuring the system, the application, the product – whatever it is – actually *works* as intended. It's about verifying that all the requirements are met, that the functionality is bulletproof, and that everything is validated.\n\nBeyond that, I know testing is about improving the reliability and performance. No one wants a slow or glitchy system. Prevention is key too: catching issues *before* they cause a major headache for users. It's about building confidence in the product, letting everyone involved know it’s ready to go. And, of course, testing always has a risk assessment element. What are the potential pitfalls? What could go wrong? Testing helps me understand and mitigate those risks. This framework, the core principles of testing, are what I need to draw from to formulate an answer.\n", }, { "type": "text", "text": 'Okay, here are a few options for a "short testing answer," depending on what context you need it for:\n\n**Option 1 (General Definition):**\nTesting is the process of evaluating a system or component to determine if it meets specified requirements, identify defects, and ensure quality.\n\n**Option 2 (Purpose-focused):**\nTesting aims to find defects, verify functionality, and build confidence in a product\'s quality and reliability.\n\n**Option 3 (Very concise):**\nTesting ensures quality by finding defects and verifying functionality.\n\n**Option 4 (Benefit-focused):**\nTesting reduces risk and improves user experience by identifying and fixing issues before release.', }, ], refusal=None, role="assistant", annotations=None, audio=None, function_call=None, tool_calls=None, ), ) ], created=<masked>, model="gemini-2.5-flash", object="chat.completion", service_tier=None, system_fingerprint=None, usage=CompletionUsage( completion_tokens=140, prompt_tokens=4, total_tokens=270, completion_tokens_details=None, prompt_tokens_details=None, reasoning_tokens=126, ), )

If thinking is disabled then the content field is a string as expected. This is undocumented.

And there isn't even documentation on how and what to set in the `thinking` extra body, and I need to refer to claude example to know how to enable thinking.

Moreover, you create an OpenAI-compatible endpoint, but the response is not compatible: `content` field is expected to be `Optional[str]` according to official openai python package. And your `content` field can vary depending on the model and model setting. This makes any half competent packages that check for typing when processing chat completion message breaks. Even your hosted Mlflow raises warning about unexpected response format. The point of having a "openai compatible" endpoint is so that it can be used pretty much everywhere, and you're making it usable nowhere. And for gemini models specifically, google already provides a openai-compatible endpoint in both vertexai and gemini api, and you went out of your way to make it not compatible. Why?

Re: Petition to update your documentation about inferencing LLMs and standardize LLM response format

Louis_Frolio — Fri, 10 Apr 2026 11:28:49 GMT

Hey @trickywhitecat ,

You're right that when thinking.type is enabled for Gemini 2.5 Flash on a Databricks serving endpoint, the content field comes back as a list of dictionaries instead of a plain string. That breaks the expected OpenAI ChatCompletion schema (content: Optional[str]), and anything relying on strict type checking, including MLflow, will trip over it.

A few things worth calling out:

This is specific to reasoning/thinking mode. With thinking disabled, content comes back as a string as expected.
The thinking parameter config for Gemini models is underdocumented. Right now you have to reverse-engineer the payload structure from the Claude example. There should be a dedicated Gemini 2.5 example showing both the extra_body configuration and the actual response structure.
Your core point stands: an "OpenAI-compatible" endpoint should return OpenAI-compatible responses regardless of model config. If the content field shape changes based on a toggle, that either needs to be normalized server-side or documented explicitly.

For anyone working around this today, here's a simple adapter:

msg = response.choices[0].message

if isinstance(msg.content, list):
    text_blocks = [
        block.get("text", "")
        for block in msg.content
        if block.get("type") == "text"
    ]
    answer = "".join(text_blocks)
else:
    answer = msg.content or ""

That gives you a plain string you can pass into any tooling that expects the standard schema.

The documentation gaps your post highlights are concrete and actionable:

A Gemini 2.5-specific example in the reasoning models doc showing the exact extra_body payload and the response content structure when thinking is enabled.
An explicit callout in the OpenAI-compatible sections that reasoning models may return List[Block] instead of str, with a normalization snippet like the one above.

I'd also encourage you to file this through Databricks Support if you haven't already. Community posts surface visibility, but a support ticket gets it into the product team's tracking.

Cheers, Lou

topic Re: Petition to update your documentation about inferencing LLMs and standardize LLM response format in Generative AI

Petition to update your documentation about inferencing LLMs and standardize LLM response format

Re: Petition to update your documentation about inferencing LLMs and standardize LLM response format