Re: using concurrent.futures for parallelization

mark_ott · ‎10-31-2025

The "SparkSession$ does not exist in the JVM" error in your scenario is almost always due to the use of multiprocessing (like ProcessPoolExecutor) with Spark. Spark contexts and sessions cannot safely be shared across processes, especially in Databricks or PySpark environments, because the JVM and SparkContext are not fork-safe and cannot be serialized and sent to child processes reliably.

Why This Happens

SparkSession is JVM-bound: Each process launched by ProcessPoolExecutor spins up its own Python interpreter, and tries to access the SparkSession that was created in the parent. This won't work; the session cannot be forked/copied into child JVMs.
PySpark/Databricks does not support multiprocessing using Python's multiprocessing or ProcessPoolExecutor for Spark jobs. Each process must independently create its own SparkSession, which can cause resource contention, failures, and the SparkSession$ error you're seeing.
Sequential operation works because only one Python process and the main Spark JVM are used; no parallelism implies no cross-process sharing issues.

How to Fix

Use thread-based, not process-based parallelism:

Replace ProcessPoolExecutor with ThreadPoolExecutor. Spark is not forkable but can tolerate concurrent threads within the same driver process — provided thread safety is handled.
Alternatively, use Spark's own parallelism through DataFrame partitioning or mapPartitions.

Example Fix

python

from concurrent.futures import ThreadPoolExecutor, as_completed
...
if prm["NUM_PARALLEL"] > 1:
    with ThreadPoolExecutor(max_workers=prm["NUM_PARALLEL"]) as executor:
        job_list = [executor.submit(process_row, row) for row in rows_lst]
        for job in as_completed(job_list):
            pass
else:
    for row in rows_lst:
        process_row(row)

This should eliminate the "SparkSession$" error because all threads share the same process, JVM, and Spark context.

Important Caveats:

Make sure your homegrown library and all Spark interactions are thread-safe.
Limit the level of parallelism (NUM_PARALLEL) to not overwhelm the Spark driver.
Consider possible GIL limitations for non-Spark workloads.

Alternatives

Leverage Spark native partitioning: Run a single distributed Spark job that pulls data in partitions.
Use Databricks workflows: For large-scale orchestrations, Databricks Jobs can run tasks in parallel safely.

References

: Learn why SparkSession errors happen with multiprocessing and how to fix them
: Official Databricks documentation on SparkSession lifecycle and forks

Switching to ThreadPoolExecutor will let your parallelism work without causing SparkSession errors. For industrial-scale data loads, native Spark parallelism is preferable.