cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks Connect - Will I ever have to Stop clean up Spark session when creating new per request

JTBS
New Contributor II

I have API that triggers Spark calculations - with API hosted by Python 3.12 pod in AKS and connects to Databricks cluster using Databricks 18.1.1.

Initially I was using getOrCreate call on my API requests and all works.

But problem is - as Spark session is shared.. after a while when new API request comes in, it fails with "INVALID SESSION" - this is because Cluster expired the Spark session after waiting enough due to inactivity - looks like.

I also felt like sharing same Spark session is not my intention as I was to isolate each API call/request and create NEW Spark session per request.

So now I am using create NEW Spark session per API request and dont have any issues.

But do I need to do any clean up?

If I am creating NEW session , I feel tempted to call Stop session or do any clean up once I am done.

But documentation seems to suggest never explicitly call STOP Session.

I am not seeing any issues so far by not calling STOP but not sure if this causes any resource leaks and want to do what is right?

What is right way to clean up Spark Session in this  case when its created explicitly per API request? Do nothing like what I am doing OR call STOP - please suggest

 

2 ACCEPTED SOLUTIONS

Accepted Solutions

balajij8
Contributor III

Hi, 

You can use Databricks SQL Connector for the activities instead of Databricks Connect if you require simple setup & easier management (session etc) with features below

  • SQL queries (SELECT, INSERT, UPDATE)
  • Executing parameterized SQL statements
  • Fetching data and doing transformations in Python after retrieval

More details here

Apache Spark generally requires you to explicitly declare that they are complete by using commands such as sys.exit() or sc.stop(). Databricks automatically terminates and cleans up jobs as they reach completion, so these commands are not necessary and should be removed. The automatic cleanup occurs when the request completes.

View solution in original post

emma_s
Databricks Employee
Databricks Employee

Hi there,

I

 

Short answer

You should call spark.stop() when you're done with each session. What you're doing now (not calling it) works, but it's not ideal — you're relying on the server-side idle timeout to clean up after you, and in the meantime each orphaned session consumes memory on the cluster for its SQLConf and SessionState. On a busy API with many requests, that can accumulate until the cluster eventually reclaims them.

Why the docs say "don't call stop"

The documentation warning about not calling stop() is aimed at a different scenario — specifically, when you're running inside a Databricks notebook or workspace environment where the session lifecycle is managed for you. In that context, calling stop() can tear down shared infrastructure you didn't create. It doesn't apply to your situation, where you're an external client creating sessions explicitly via Databricks Connect from an AKS pod. (Databricks Connect in notebooks)

What about automatic cleanup?

You may have read that Databricks Connect handles session cleanup automatically — and that's partially true. There are two mechanisms:
  • Process exit / shutdown hooks: PySpark registers an atexit handler that calls stop() on active sessions when the Python process terminates. If you were running a short-lived script (start, do work, exit), this would clean things up for you automatically. However, your API is a long-lived server process — it doesn't exit between requests. The shutdown hook only fires when the pod itself restarts or scales down, not after each request completes.
  • Server-side idle timeout: The Spark Connect server passively cleans up idle sessions after a period of inactivity. The release notes confirm: "Databricks Connect now automatically closes expired sessions on the client side." So sessions do eventually get reclaimed — but in the meantime they're sitting there consuming driver memory. (Databricks Connect release notes)
For a long-running API server creating a new session per request, neither mechanism gives you prompt cleanup. You'd accumulate sessions until the timeout kicks in.

What spark.stop() actually does in Databricks Connect

Since version 14.2.0, calling stop() on a Databricks Connect session sends a ReleaseSession RPC to the server, which:
  • Interrupts any running operations tied to that session
  • Releases server-side resources (memory, cached state)
  • Closes the gRPC channel on the client side
Since version 15.1.0, stop() is also idempotent — calling it on an already-closed or expired session won't throw an error. So it's safe to call in a finally block without worrying about race conditions with the idle timeout. (Databricks Connect release notes)

Recommended pattern

 

 
from databricks.connect import DatabricksSession

def handle_request():
    spark = DatabricksSession.builder.create()  # new session per request
    try:
        # your Spark work here
        result = spark.sql("SELECT ...")
        return result.collect()
    finally:
        spark.stop()  # clean up immediately

 
A few notes:
  • Use .create() rather than .getOrCreate() — you've already figured this out. The create() API was introduced in 16.0 specifically for this use case (always creates a fresh session rather than returning an existing one). (Databricks Connect release notes)
  • Wrap stop() in a finally block so it runs even if your Spark work throws an exception.
  • If you're on 18.1.1 as you mentioned, you have all the idempotent-stop and transient-retry improvements, so this is straightforward.

What happens if you don't call stop

It's not catastrophic — the server will eventually clean up idle sessions via the timeout. But you'll accumulate orphaned sessions in the interim, each holding memory on the driver. Under sustained load this can contribute to driver memory pressure.
That said, if your request volume is modest and you're not seeing issues, the idle timeout is probably handling things adequately. It's more of a "doing it properly" thing than a "this will definitely break" thing.

Relevant docs

View solution in original post

2 REPLIES 2

balajij8
Contributor III

Hi, 

You can use Databricks SQL Connector for the activities instead of Databricks Connect if you require simple setup & easier management (session etc) with features below

  • SQL queries (SELECT, INSERT, UPDATE)
  • Executing parameterized SQL statements
  • Fetching data and doing transformations in Python after retrieval

More details here

Apache Spark generally requires you to explicitly declare that they are complete by using commands such as sys.exit() or sc.stop(). Databricks automatically terminates and cleans up jobs as they reach completion, so these commands are not necessary and should be removed. The automatic cleanup occurs when the request completes.

emma_s
Databricks Employee
Databricks Employee

Hi there,

I

 

Short answer

You should call spark.stop() when you're done with each session. What you're doing now (not calling it) works, but it's not ideal — you're relying on the server-side idle timeout to clean up after you, and in the meantime each orphaned session consumes memory on the cluster for its SQLConf and SessionState. On a busy API with many requests, that can accumulate until the cluster eventually reclaims them.

Why the docs say "don't call stop"

The documentation warning about not calling stop() is aimed at a different scenario — specifically, when you're running inside a Databricks notebook or workspace environment where the session lifecycle is managed for you. In that context, calling stop() can tear down shared infrastructure you didn't create. It doesn't apply to your situation, where you're an external client creating sessions explicitly via Databricks Connect from an AKS pod. (Databricks Connect in notebooks)

What about automatic cleanup?

You may have read that Databricks Connect handles session cleanup automatically — and that's partially true. There are two mechanisms:
  • Process exit / shutdown hooks: PySpark registers an atexit handler that calls stop() on active sessions when the Python process terminates. If you were running a short-lived script (start, do work, exit), this would clean things up for you automatically. However, your API is a long-lived server process — it doesn't exit between requests. The shutdown hook only fires when the pod itself restarts or scales down, not after each request completes.
  • Server-side idle timeout: The Spark Connect server passively cleans up idle sessions after a period of inactivity. The release notes confirm: "Databricks Connect now automatically closes expired sessions on the client side." So sessions do eventually get reclaimed — but in the meantime they're sitting there consuming driver memory. (Databricks Connect release notes)
For a long-running API server creating a new session per request, neither mechanism gives you prompt cleanup. You'd accumulate sessions until the timeout kicks in.

What spark.stop() actually does in Databricks Connect

Since version 14.2.0, calling stop() on a Databricks Connect session sends a ReleaseSession RPC to the server, which:
  • Interrupts any running operations tied to that session
  • Releases server-side resources (memory, cached state)
  • Closes the gRPC channel on the client side
Since version 15.1.0, stop() is also idempotent — calling it on an already-closed or expired session won't throw an error. So it's safe to call in a finally block without worrying about race conditions with the idle timeout. (Databricks Connect release notes)

Recommended pattern

 

 
from databricks.connect import DatabricksSession

def handle_request():
    spark = DatabricksSession.builder.create()  # new session per request
    try:
        # your Spark work here
        result = spark.sql("SELECT ...")
        return result.collect()
    finally:
        spark.stop()  # clean up immediately

 
A few notes:
  • Use .create() rather than .getOrCreate() — you've already figured this out. The create() API was introduced in 16.0 specifically for this use case (always creates a fresh session rather than returning an existing one). (Databricks Connect release notes)
  • Wrap stop() in a finally block so it runs even if your Spark work throws an exception.
  • If you're on 18.1.1 as you mentioned, you have all the idempotent-stop and transient-retry improvements, so this is straightforward.

What happens if you don't call stop

It's not catastrophic — the server will eventually clean up idle sessions via the timeout. But you'll accumulate orphaned sessions in the interim, each holding memory on the driver. Under sustained load this can contribute to driver memory pressure.
That said, if your request volume is modest and you're not seeing issues, the idle timeout is probably handling things adequately. It's more of a "doing it properly" thing than a "this will definitely break" thing.

Relevant docs