topic Hi @abhijit007, Your debugging was thorough and you corre... in Data Engineering

Databricks App Issue– “socket hang up / ECONNRESET” when API call runs > 30 seconds

abhijit007 — Fri, 06 Mar 2026 08:55:25 GMT

Problem Statement:
We are running a Data App on Databricks that uses Next.js (frontend) and FastAPI (backend). The backend calls a Databricks Agent (AgentBricks) via a serving endpoint, which typically needs ~1 minute to return a response. However, any request that takes > ~30 seconds results in:
Error: socket hang up { code: 'ECONNRESET' }
This happens consistently and before the Agent finishes processing.

What We Are Trying To Do ::
We’re simply trying to display the agent’s response in the frontend. The frontend makes a request:
Next.js → FastAPI → Databricks AgentBricks endpoint → FastAPI → Next.js

The expectation is that long-running agent responses (45–60 sec) should return normally through the API chain.

What We Tried ::

1. Route timeout configuration in Next.js
Set long server execution time:
export const maxDuration = 300;
export const runtime = "nodejs";

Configures OK, but Next.js still disconnects at ~30 seconds.

2. FastAPI timeout controls
Set timeout in OpenAI/DBR client to 300s:

ws_openai_client = AsyncOpenAI(
api_key=DATABRICKS_PAT_TOKEN,
base_url=BASE_URL + "/serving-endpoints",
timeout=300.0
)

FastAPI is NOT timing out, meaning the reset occurs before the Python layer gets a chance to respond.

3. Artificial delay test
To isolate the issue, we added a 60-second sleep:

@app.post("/api/demo/alerts")
def get_alert_data(payload):
time.sleep(60)
return {...}

Result: Next.js → FastAPI request still dies at ~30 seconds Confirms the problem is NOT related to the agent or Databricks model serving.

4. Simplified Agent Prompt:
A lightweight prompt that finishes < 30 seconds works fine.
Confirms timeout threshold is the limiting factor.

Observation:
The request is being terminated before FastAPI finishes processing and before the Databricks agent responds. FastAPI continues running normally in the background, which means the timeout is happening upstream of the Python backend.

Re: Databricks App Issue– “socket hang up / ECONNRESET” when API call runs > 30 seconds

Lu_Wang_ENB_DBX — Fri, 06 Mar 2026 15:53:12 GMT

Summary: The ECONNRESET error at ~30 seconds is caused by the Databricks Apps managed ingress request router, which strictly terminates long-running synchronous HTTP requests to protect platform stability. Local framework configurations (like Next.js maxDuration or FastAPI timeouts) apply only to the container and cannot override these platform-level gateway limits. To fix this, Databricks best practices dictate implementing an asynchronous "status pull" (polling) pattern.

Why Your App is Disconnecting Databricks Apps operate within a managed Serverless Compute Plane and sit behind a Databricks-controlled Request Router. This routing layer actively monitors connection health and enforces strict timeouts on synchronous requests (often dropping them if no data is passed within ~30-120 seconds). When your Next.js frontend waits synchronously for FastAPI (which is in turn waiting ~60 seconds for the AgentBricks endpoint), the Databricks ingress proxy assumes the connection has hung and forces a disconnect (ECONNRESET). Because the timeout happens upstream at the proxy layer, your FastAPI process remains unaware and finishes the task normally in the background.

The Recommended Solution: "Status Pull" Pattern To accommodate workloads like long-running AI agents that exceed gateway timeout thresholds, you must re-architect the interaction between Next.js and FastAPI to use an asynchronous polling model:

Trigger and Return: Update your initial FastAPI endpoint (/api/demo/alerts) so that it kicks off the AgentBricks call as a background task. It should immediately respond to Next.js with an HTTP 202 (Accepted) status and a unique task_id.
Implement a Status Endpoint: Create a secondary FastAPI endpoint (e.g., /api/demo/alerts/status/{task_id}) that checks the state of the background task (e.g., pending, processing, or complete).
Poll from Next.js: Configure your Next.js frontend to periodically ping the status endpoint (for example, every 3–5 seconds) until the Agent finishes processing and the final response payload is ready to be fetched and displayed.

This pattern circumvents the platform's ingress timeout limits, frees up UI threads, and is the standard runtime performance recommendation for heavy or long-running tasks on Databricks Apps.

Hi @abhijit007, Your debugging was thorough and you corre...

SteveOstrowski — Mon, 09 Mar 2026 06:00:20 GMT

Hi @abhijit007,

Your debugging was thorough and you correctly isolated the issue: the timeout is happening upstream of your application code. Databricks Apps run behind a managed ingress/request router that enforces request-level timeouts (typically around 30 seconds for idle connections). Because this is a platform-level gateway, no amount of configuration inside your Next.js or FastAPI code can extend it.

The recommended approach is to switch from a synchronous request/response pattern to an asynchronous polling pattern (sometimes called "status pull"). Here is a concrete implementation you can adapt:

FASTAPI BACKEND CHANGES

Use FastAPI's BackgroundTasks to kick off the long-running agent call, store results in memory (or a more durable store), and return immediately with a task ID.

import uuid
import asyncio
from fastapi import FastAPI, BackgroundTasks
from fastapi.responses import JSONResponse

app = FastAPI()

# In-memory task store (use Redis or a database for production)
tasks = {}

async def call_agent(task_id: str, payload: dict):
  try:
      tasks[task_id]["status"] = "processing"
      # Your existing agent call goes here
      response = await ws_openai_client.chat.completions.create(
          model="your-agent-endpoint",
          messages=payload["messages"],
      )
      tasks[task_id]["status"] = "complete"
      tasks[task_id]["result"] = response.choices[0].message.content
  except Exception as e:
      tasks[task_id]["status"] = "failed"
      tasks[task_id]["error"] = str(e)

@app.post("/api/demo/alerts")
async def start_alert(payload: dict, background_tasks: BackgroundTasks):
  task_id = str(uuid.uuid4())
  tasks[task_id] = {"status": "pending", "result": None, "error": None}
  background_tasks.add_task(call_agent, task_id, payload)
  return JSONResponse(status_code=202, content={"task_id": task_id})

@app.get("/api/demo/alerts/status/{task_id}")
async def get_alert_status(task_id: str):
  task = tasks.get(task_id)
  if not task:
      return JSONResponse(status_code=404, content={"error": "Task not found"})
  return task

NEXT.JS FRONTEND CHANGES

Poll the status endpoint every few seconds until the result is ready.

async function fetchAlertData(payload) {
// 1. Start the task
const startRes = await fetch("/api/demo/alerts", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify(payload),
});
const { task_id } = await startRes.json();

// 2. Poll for completion
while (true) {
  await new Promise((r) => setTimeout(r, 3000)); // poll every 3 seconds
  const statusRes = await fetch(`/api/demo/alerts/status/${task_id}`);
  const statusData = await statusRes.json();

  if (statusData.status === "complete") {
    return statusData.result;
  }
  if (statusData.status === "failed") {
    throw new Error(statusData.error || "Agent call failed");
  }
  // Otherwise still "pending" or "processing", keep polling
}
}

ALTERNATIVE: SERVER-SENT EVENTS (SSE)

If the Databricks Apps ingress supports streaming responses (where bytes are being sent continuously), you can use SSE instead of polling. SSE keeps the connection alive by sending incremental data, which can prevent the idle-connection timeout from firing:

from fastapi.responses import StreamingResponse
import json, asyncio

@app.post("/api/demo/alerts/stream")
async def stream_alert(payload: dict):
  async def event_generator():
      task_id = str(uuid.uuid4())
      # Start agent call as a concurrent task
      agent_task = asyncio.create_task(call_agent(task_id, payload))
      tasks[task_id] = {"status": "pending", "result": None, "error": None}

      while not agent_task.done():
          yield f"data: {json.dumps({'status': 'processing'})}\n\n"
          await asyncio.sleep(2)

      task = tasks[task_id]
      yield f"data: {json.dumps(task)}\n\n"

  return StreamingResponse(event_generator(), media_type="text/event-stream")

However, note that whether SSE bypasses the timeout depends on the specific ingress behavior. The polling approach is the safest and most widely supported pattern.

KEY POINTS

1. The ~30-second cutoff is enforced by the Databricks Apps managed ingress router, not by your application frameworks.

2. The polling pattern is the standard recommendation for any Databricks App workload that exceeds the gateway timeout.

3. For production use, consider replacing the in-memory tasks dictionary with something persistent (Redis, a Delta table, or even a simple SQLite database) so that results survive app restarts.

4. You can also look at Databricks Apps documentation for additional guidance on configuring compute size and app architecture:
https://docs.databricks.com/en/dev-tools/databricks-apps/index.html

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.

If this answer resolves your question, could you mark it as "Accept as Solution"? That helps other users quickly find the correct fix.