TL;DR
Inconsistent & Undocumented Response Format: Contrary to Databricks documentation, enabling thinking.type for Gemini 2.5 Flash changes the ChatCompletion content field from a standard string to a list of dictionaries, breaking expected behavior.
Missing Configuration Documentation: There is no official documentation on how to configure the thinking parameter; users must infer settings from unrelated (Claude) examples.
Breaks OpenAI Compatibility: The altered content format violates the OpenAI schema (which expects Optional[str]), causing errors in standard OpenAI Python packages and tools like MLflow that rely on strict type checking.
Unnecessary Regression: While Google's native Vertex AI and Gemini API provide working OpenAI-compatible endpoints, this specific implementation needlessly breaks compatibility, rendering the "OpenAI-compatible" claim false.
According to your documentation in https://docs.databricks.com/aws/en/machine-learning/model-serving/query-reason-models , one can expect gemini 2.5 flash to have the same response format as the `Gemini model example`, but in reality, gemini 2.5 flash response format will change depending on whether `thinking.type` is set to `enabled` or not:
- If thinking is enabled then `content` field of `ChatCompletion` is a list of dict with key `type` and `text`, and the value for `type` is always "text":
ChatCompletion(
id=None,
choices=[
Choice(
finish_reason="stop",
index=0,
logprobs=None,
message=ChatCompletionMessage(
content=[
{
"type": "text",
"text": "Okay, here's how I'd approach this \"Short testing answer\" request. First, I need to really understand what the user is after. \"Short testing answer\"... that's deliberately vague, isn't it? So, I have to figure out the user's intent, the context, what they *really* need. They're asking for something concise about testing, fine, but what *aspect* of testing? That's the key.\n\nSo, I immediately start breaking down the core principles of testing in my mind. The fundamentals. What *is* testing, really? It's about finding those nasty bugs, those defects that could cripple things later on. But it's more than that; it's about *quality*. Ensuring the system, the application, the product – whatever it is – actually *works* as intended. It's about verifying that all the requirements are met, that the functionality is bulletproof, and that everything is validated.\n\nBeyond that, I know testing is about improving the reliability and performance. No one wants a slow or glitchy system. Prevention is key too: catching issues *before* they cause a major headache for users. It's about building confidence in the product, letting everyone involved know it’s ready to go. And, of course, testing always has a risk assessment element. What are the potential pitfalls? What could go wrong? Testing helps me understand and mitigate those risks. This framework, the core principles of testing, are what I need to draw from to formulate an answer.\n",
},
{
"type": "text",
"text": 'Okay, here are a few options for a "short testing answer," depending on what context you need it for:\n\n**Option 1 (General Definition):**\nTesting is the process of evaluating a system or component to determine if it meets specified requirements, identify defects, and ensure quality.\n\n**Option 2 (Purpose-focused):**\nTesting aims to find defects, verify functionality, and build confidence in a product\'s quality and reliability.\n\n**Option 3 (Very concise):**\nTesting ensures quality by finding defects and verifying functionality.\n\n**Option 4 (Benefit-focused):**\nTesting reduces risk and improves user experience by identifying and fixing issues before release.',
},
],
refusal=None,
role="assistant",
annotations=None,
audio=None,
function_call=None,
tool_calls=None,
),
)
],
created=<masked>,
model="gemini-2.5-flash",
object="chat.completion",
service_tier=None,
system_fingerprint=None,
usage=CompletionUsage(
completion_tokens=140,
prompt_tokens=4,
total_tokens=270,
completion_tokens_details=None,
prompt_tokens_details=None,
reasoning_tokens=126,
),
)
If thinking is disabled then the content field is a string as expected. This is undocumented.
And there isn't even documentation on how and what to set in the `thinking` extra body, and I need to refer to claude example to know how to enable thinking.
Moreover, you create an OpenAI-compatible endpoint, but the response is not compatible: `content` field is expected to be `Optional[str]` according to official openai python package. And your `content` field can vary depending on the model and model setting. This makes any half competent packages that check for typing when processing chat completion message breaks. Even your hosted Mlflow raises warning about unexpected response format. The point of having a "openai compatible" endpoint is so that it can be used pretty much everywhere, and you're making it usable nowhere. And for gemini models specifically, google already provides a openai-compatible endpoint in both vertexai and gemini api, and you went out of your way to make it not compatible. Why?