Databricks Community

mnissen1337 · yesterday

I’m building a Databricks App that continuously queries a SQL Warehouse roughly every 30 seconds to retrieve updated data.

To avoid the overhead of repeatedly opening new connections, I’m currently caching the Databricks SQL connection using lru_cache.

from functools import lru_cache

from databricks import sql
from databricks.sql.client import Connection

from config import settings

cfg = Config()


@lru_cache(maxsize=1)
def get_connection() -> Connection:
    """Return a cached Databricks SQL connection using the configured warehouse."""
    return sql.connect(
        server_hostname=cfg.host,
        http_path=settings.sql_warehouse_http_path,
        credentials_provider=lambda: cfg.authenticate,
        use_cloud_fetch=False,
    )


def execute_query(query: str, params: dict | list | None = None) -> None:
    """Execute a SQL statement that returns no results (DDL / DML)."""
    conn = get_connection()
    with conn.cursor() as cursor:
        cursor.execute(query, parameters=params)

The issue

After running fine for a while (roughly ~10 hours, not exact), the app starts failing with this error:

2026-06-01 06:10:27,780 [INFO] databricks.sql.thrift_backend - Error during request to server:
{"method": "ExecuteStatement",
 "session-id": "...",
 "http-code": 200,
 "error-message": "",
 "original-exception": "ExecuteStatement command can only be retried for codes 429 and 503",
 "no-retry-reason": "non-retryable error",
 "attempt": "1/30"}

Once this happens, the app stops working reliably until I redeploy it. After redeploying, everything works again.

My assumption

Because the issue only appears after long runtime, I strongly suspect that the cached connection/session eventually becomes stale or invalid.

Given that I am reusing the same connection object for all queries, this seems like the most likely explanation.

Questions

Is it expected that a long-lived cached Databricks SQL connection eventually becomes invalid?
What is the recommended approach for this kind of long-running polling app?
- periodic reconnection (e.g. every 1 hour)?
- retry + reconnect on failure?
- or something like connection pooling provided by Databricks (if such exists, could not find it myself)?
Does Databricks have any built-in mechanism to handle stale sessions automatically for long-lived connections?

Thanks in advance! 🙂

balajij8 · 4 hours ago

You can expect Long lived cached SQL connections to become stale (due to idle and session timeouts) for better resource governance (warehouse auto scaling), security and optimizations (TLS drops, backend session expiration, routing). The underlying Thrift session is invalidated.

You can follow below

Connection lifecycle management - You can implement a reconnect on failure wrapper or use SQLAlchemy with the databricks version. Its QueuePool provides various parameters - pool_pre_ping (True - validates connections before use) and pool_recycle (1800 - forces refresh every 30 mins) solving staleness. More details here
Idempotent retries - You can catch session errors, discard the connection, instantiate a new one and implement retry with exponential backoff.
Warehouse - You can disable auto stop in the warehouse if you require continuous availability.

You can plan for client side recycling as recycling is user responsibility.

View solution in original post

balajij8 · 4 hours ago

You can expect Long lived cached SQL connections to become stale (due to idle and session timeouts) for better resource governance (warehouse auto scaling), security and optimizations (TLS drops, backend session expiration, routing). The underlying Thrift session is invalidated.

You can follow below

Connection lifecycle management - You can implement a reconnect on failure wrapper or use SQLAlchemy with the databricks version. Its QueuePool provides various parameters - pool_pre_ping (True - validates connections before use) and pool_recycle (1800 - forces refresh every 30 mins) solving staleness. More details here
Idempotent retries - You can catch session errors, discard the connection, instantiate a new one and implement retry with exponential backoff.
Warehouse - You can disable auto stop in the warehouse if you require continuous availability.

You can plan for client side recycling as recycling is user responsibility.

mnissen1337 · 3 hours ago

When looking into the documentation for SQLAlachemy with Databricks it seems like you need to specify a PAT when creating the engine. Is it correct that it does not allow for any other authentication methods? Seems like it is just wrapping on top off the databricks sql module so cant we use m2m authentication (similarly to what i have done in my post using the credentials_provider, passing it the relevant information from cfg.authenticate?)

balajij8 · an hour ago

SQLAlchemy dialect is a wrapper for the native databricks sql connector. You can try to pass the various authentication configuration supported by the underlying SQL connector directly into the connect_args dictionary parameter of the alchemy engine.

import os
from sqlalchemy import create_engine, text
from databricks.sql.auth import AuthType

# Workspace and app credentials
DATABRICKS_HOST = os.environ.get("DBX_HOST")
HTTP_PATH = os.environ.get("DBX_HTTP_PATH")
AZURE_CLIENT_ID = os.environ.get("AZURE_APP_ID")
AZURE_CLIENT_SECRET = os.environ.get("AZURE_APP_SECRET")

# Build the engine
engine = create_engine(
    f"databricks://token:not_used@{DATABRICKS_HOST}?http_path={HTTP_PATH}&catalog=main&schema=default",
    connect_args={
        "auth_type": AuthType.AZURE_SP_M2M.value,
        "azure_client_id": AZURE_CLIENT_ID,
        "azure_client_secret": AZURE_CLIENT_SECRET,
    }
)

Meanwhile, you can remove the cache lru_cache until you get the new approach fixed.