cancel
Showing results forĀ 
Search instead forĀ 
Did you mean:Ā 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forĀ 
Search instead forĀ 
Did you mean:Ā 

Databricks SQL connection becomes stale in long-running app

mnissen1337
New Contributor III

I’m building a Databricks App that continuously queries a SQL Warehouse roughly every 30 seconds to retrieve updated data.

To avoid the overhead of repeatedly opening new connections, I’m currently caching the Databricks SQL connection using lru_cache.

 
from functools import lru_cache

from databricks import sql
from databricks.sql.client import Connection

from config import settings

cfg = Config()


@lru_cache(maxsize=1)
def get_connection() -> Connection:
"""Return a cached Databricks SQL connection using the configured warehouse."""
return sql.connect(
server_hostname=cfg.host,
http_path=settings.sql_warehouse_http_path,
credentials_provider=lambda: cfg.authenticate,
use_cloud_fetch=False,
)


def execute_query(query: str, params: dict | list | None = None) -> None:
"""Execute a SQL statement that returns no results (DDL / DML)."""
conn = get_connection()
with conn.cursor() as cursor:
cursor.execute(query, parameters=params)

The issue

After running fine for a while (roughly ~10 hours, not exact), the app starts failing with this error:

 
2026-06-01 06:10:27,780 [INFO] databricks.sql.thrift_backend - Error during request to server:
{"method": "ExecuteStatement",
"session-id": "...",
"http-code": 200,
"error-message": "",
"original-exception": "ExecuteStatement command can only be retried for codes 429 and 503",
"no-retry-reason": "non-retryable error",
"attempt": "1/30"}

Once this happens, the app stops working reliably until I redeploy it. After redeploying, everything works again.


My assumption

Because the issue only appears after long runtime, I strongly suspect that the cached connection/session eventually becomes stale or invalid.

Given that I am reusing the same connection object for all queries, this seems like the most likely explanation.


Questions

  1. Is it expected that a long-lived cached Databricks SQL connection eventually becomes invalid? 
  2. What is the recommended approach for this kind of long-running polling app?
    • periodic reconnection (e.g. every 1 hour)?
    • retry + reconnect on failure?
    • or something like connection pooling provided by Databricks (if such exists, could not find it myself)?
  3. Does Databricks have any built-in mechanism to handle stale sessions automatically for long-lived connections?

Thanks in advance! šŸ™‚ 

1 ACCEPTED SOLUTION

Accepted Solutions

balajij8
Contributor III
You can expect Long lived cached SQL connections to become stale (due to idle and session timeouts) for better resource governance (warehouse auto scaling), security and optimizations (TLS drops, backend session expiration, routing). The underlying Thrift session is invalidated.
 
You can follow below
  • Connection lifecycle management - You can implement a reconnect on failure wrapper or use SQLAlchemy with the databricks version. Its QueuePool provides various parameters - pool_pre_ping (True - validates connections before use) and pool_recycle (1800 - forces refresh every 30 mins) solving staleness. More details here
  • Idempotent retries - You can catch session errors, discard the connection, instantiate a new one and implement retry with exponential backoff.
  • Warehouse - You can disable auto stop in the warehouse if you require continuous availability. 

You can plan for client side recycling as recycling is user responsibility.

View solution in original post

3 REPLIES 3

balajij8
Contributor III
You can expect Long lived cached SQL connections to become stale (due to idle and session timeouts) for better resource governance (warehouse auto scaling), security and optimizations (TLS drops, backend session expiration, routing). The underlying Thrift session is invalidated.
 
You can follow below
  • Connection lifecycle management - You can implement a reconnect on failure wrapper or use SQLAlchemy with the databricks version. Its QueuePool provides various parameters - pool_pre_ping (True - validates connections before use) and pool_recycle (1800 - forces refresh every 30 mins) solving staleness. More details here
  • Idempotent retries - You can catch session errors, discard the connection, instantiate a new one and implement retry with exponential backoff.
  • Warehouse - You can disable auto stop in the warehouse if you require continuous availability. 

You can plan for client side recycling as recycling is user responsibility.

mnissen1337
New Contributor III

When looking into the documentation for SQLAlachemy with Databricks it seems like you need to specify a PAT when creating the engine. Is it correct that it does not allow for any other authentication methods? Seems like it is just wrapping on top off the databricks sql module so cant we use m2m authentication (similarly to what i have done in my post using the credentials_provider, passing it the relevant information from cfg.authenticate?)

balajij8
Contributor III

SQLAlchemy dialect is a wrapper for the native databricks sql connector. You can try to pass the various authentication configuration supported by the underlying SQL connector directly into the connect_args dictionary parameter of the alchemy engine.

import os
from sqlalchemy import create_engine, text
from databricks.sql.auth import AuthType

# Workspace and app credentials
DATABRICKS_HOST = os.environ.get("DBX_HOST")
HTTP_PATH = os.environ.get("DBX_HTTP_PATH")
AZURE_CLIENT_ID = os.environ.get("AZURE_APP_ID")
AZURE_CLIENT_SECRET = os.environ.get("AZURE_APP_SECRET")

# Build the engine
engine = create_engine(
    f"databricks://token:not_used@{DATABRICKS_HOST}?http_path={HTTP_PATH}&catalog=main&schema=default",
    connect_args={
        "auth_type": AuthType.AZURE_SP_M2M.value,
        "azure_client_id": AZURE_CLIENT_ID,
        "azure_client_secret": AZURE_CLIENT_SECRET,
    }
)

Meanwhile, you can remove the cache lru_cache until you get the new approach fixed.