Re: How to serve a RAG chain endpoint that support...

Louis_Frolio · ‎11-07-2025

Greetings @MarsWalker , here’s how to get true streaming from a served RAG chain on Databricks Model Serving.

What’s going on

The served endpoint error (“This endpoint does not support streaming”) is expected when the deployed entity doesn’t expose a stream-capable interface to Model Serving. A LangChain chain logged with the MLflow LangChain flavor will invoke fine synchronously, but the Serving layer won’t stream unless the model implements the stream contract Databricks supports (predict_stream / ResponsesAgent stream events).
Streaming is natively supported for OpenAI-compatible endpoints (external models) via the stream parameter, but that mechanism applies to chat/completions tasks on external providers; it doesn’t automatically enable streaming for custom LangChain chains you log as MLflow models.
Today, the recommended way to serve streamable custom GenAI logic (including LangChain) is to wrap your chain with the MLflow ResponsesAgent interface and implement predict_stream. That makes your endpoint stream over the Databricks OpenAI-compatible client and REST, and it integrates with tracing/evaluation.

The working recipe

1) Wrap your chain in a ResponsesAgent with predict_stream.

# pip install -U mlflow databricks-agents databricks-langchain langchain

import mlflow
from mlflow.types.responses import (
    ResponsesAgent, ResponsesAgentRequest, ResponsesAgentResponse, ResponsesAgentStreamEvent
)

class RAGChainAgent(ResponsesAgent):
    def __init__(self, chain):
        self.chain = chain

    def predict(self, request: ResponsesAgentRequest) -> ResponsesAgentResponse:
        # Non-streaming path: run and return one complete response
        answer = self.chain.invoke({"messages": request.messages})
        item = self.create_text_output_item(text=answer)
        return ResponsesAgentResponse(output_items=[item])

    def predict_stream(self, request: ResponsesAgentRequest):
        # Streaming path: emit deltas, then a final "done" item
        item_id = self.new_id()
        full_text = ""
        # If your chain supports async streaming, adapt this to call chain.stream() or chain.astream()
        for chunk in self.chain.stream({"messages": request.messages}):
            text_chunk = chunk if isinstance(chunk, str) else str(chunk)
            full_text += text_chunk
            yield ResponsesAgentStreamEvent.output_text_delta(id=item_id, text=text_chunk)

        # Final completion event with the aggregated text
        yield ResponsesAgentStreamEvent.output_item_done(
            item=self.create_text_output_item(text=full_text, id=item_id)
        )

# Log the agent
with mlflow.start_run():
    mlflow.models.log_model(artifact_path="rag_agent", model=RAGChainAgent(chain))

2) Serve the logged agent as a custom model (UI or API). Agent serving is supported as a custom model in Model Serving.

3) Query with streaming enabled (Databricks OpenAI client is the easiest): ```python from databricks.sdk import WorkspaceClient w = WorkspaceClient()

client = w.serving_endpoints.get_open_ai_client()

# responses API (Predict/ResponsesAgent) stream = client.responses.create( model="<your-endpoint-name>", input=[{"role": "user", "content": "Ask me something from the knowledge base"}], stream=True, # key to trigger predict_stream )

for chunk in stream: # Handle stream chunks (delta events), aggregate, etc. print(chunk) ```

Alternatively, use chat.completions with stream=True if your wrapper exposes an OpenAI-compatible chat schema, or call predict_stream via the SDK directly; all of these map to the same streaming behavior on served agents.

Notes and common pitfalls

Query Endpoint panel in the Serving UI returns the full response for custom models; use the Databricks OpenAI client or REST/SDK to receive streamed chunks in real time.
External-model endpoints (OpenAI, Anthropic, Cohere, etc.) support “stream” for chat/completions requests out-of-the-box; that path is separate from custom MLflow/LangChain deployments and is governed by the provider’s streaming capability.
For LangChain specifically, Databricks documents the integration as experimental and recommends using MLflow/ResponsesAgent for production-grade agents, including streaming, tracing, and evaluation.

Why your current chain.py endpoint isn’t streaming

The MLflow LangChain flavor can execute your chain synchronously, but the Model Serving scoring server only streams when the logged model exposes a streamable interface (predict_stream or the ResponsesAgent streaming events). Your current logged chain doesn’t advertise that interface, so the endpoint rejects stream=True with “This endpoint does not support streaming.”

If you prefer to keep LangChain “as-is”

Keep all your LangChain logic, but add a thin agent wrapper (as shown above) that calls chain.stream/chain.astream and yields ResponsesAgentStreamEvent deltas. This approach preserves your code and makes the endpoint streamable in Databricks.

Hope this helps, Louis.