Databricks Community

MarkusFra · ‎03-22-2024

Hello,

when developing locally using Databricks connect how do I re-establish the SparkSession when the Cluster restarted? getOrCreate() seems to get the old invalid SparkSession even after Cluster restart instead of creating a new one or am I missing something?

Before Cluster restart everything works fine:

>> spark = DatabricksSession.builder.getOrCreate()
DEBUG:databricks.connect:IPython module is present.
DEBUG:databricks.connect:Falling back to default configuration from the SDK.
INFO:databricks.sdk:loading DEFAULT profile from ~/.databrickscfg: host, token, cluster_id
DEBUG:databricks.sdk:Attempting to configure auth: pat
DEBUG:databricks.connect:Creating SparkSession from SDK config: <Config: host=https://adb-**************.**.azuredatabricks.net, token=***, auth_type=pat, cluster_id=****-******-********>
DEBUG:databricks.connect:Validating configuration by using the Databricks SDK
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): adb-6130442328907134.14.azuredatabricks.net:443
DEBUG:urllib3.connectionpool:https://adb-*******************.**.azuredatabricks.net:443 "GET /api/2.0/clusters/get?cluster_id=****-******-******** HTTP/1.1" 200 None
DEBUG:databricks.sdk:GET /api/2.0/clusters/get?cluster_id=****-******-********
< 200 OK
< {
<<< REDACTED: long message with api response >>>
< }
DEBUG:databricks.connect:Session validated successfully.

>> spark.sql("SELECT now()")
Out[7]: DataFrame[now(): timestamp]

After restart of the cluster:

>> spark = DatabricksSession.builder.getOrCreate()
DEBUG:databricks.connect:IPython module is present.
DEBUG:databricks.connect:Falling back to default configuration from the SDK.
INFO:databricks.sdk:loading DEFAULT profile from ~/.databrickscfg: host, token, cluster_id
DEBUG:databricks.sdk:Attempting to configure auth: pat

>> spark.sql("SELECT now()")
Traceback (most recent call last):
  File "C:\***\lib\site-packages\IPython\core\interactiveshell.py", line 3508, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-9-4c2039c39977>", line 1, in <module>
    spark.sql("SELECT now()")
  File "C:\***\lib\site-packages\pyspark\sql\connect\session.py", line 572, in sql
    data, properties = self.client.execute_command(cmd.command(self._client))
  File "C:\***\lib\site-packages\pyspark\sql\connect\client\core.py", line 1139, in execute_command
    data, _, _, _, properties = self._execute_and_fetch(req, observations or {})
  File "C:\***\lib\site-packages\pyspark\sql\connect\client\core.py", line 1515, in _execute_and_fetch
    for response in self._execute_and_fetch_as_iterator(req, observations):
  File "C:\***\lib\site-packages\pyspark\sql\connect\client\core.py", line 1493, in _execute_and_fetch_as_iterator
    self._handle_error(error)
  File "C:\***\lib\site-packages\pyspark\sql\connect\client\core.py", line 1805, in _handle_error
    raise error
  File "C:\***\lib\site-packages\pyspark\sql\connect\client\core.py", line 1486, in _execute_and_fetch_as_iterator
    yield from handle_response(b)
  File "C:\***\lib\site-packages\pyspark\sql\connect\client\core.py", line 1406, in handle_response
    self._verify_response_integrity(b)
  File "C:\***\lib\site-packages\pyspark\sql\connect\client\core.py", line 1937, in _verify_response_integrity
    raise PySparkAssertionError(
pyspark.errors.exceptions.base.PySparkAssertionError: Received incorrect server side session identifier for request. Please create a new Spark Session to reconnect. (5601ab48-a7cf-40c6-b59c-460381c816a6 != 8282a8c4-13cd-4fda-906e-2b1d8bec2115)

Shouldn't getOrCreate() recognize that it has to create a new Session? Am I doing something wrong? How do I forcibly create a new Session? I cannot use spark.stop() since this leads to the same error.

I am using databricks-connect 14.3.1, python 3.10.12

MarkusFra · ‎03-26-2024

Thank you for your reply, @Retired_mod . But there is no issue in the availability of databricks-connect. I had a bit time to look into it and found that this issue does not exist in databricks-connect with a custer with runtime 13.3. It occurs with databricks-connect 14.3 and a cluster with Runtime 14.3.

databricks-connect-13.3 and Runtime 13.3 Cluster:

from databricks.connect import DatabricksSession

spark = DatabricksSession.builder.profile("DEBUGGING_133").getOrCreate()
spark.sql("SELECT 1")
# output: DataFrame[1: int]

# >>> Databricks cluster shuts down (e.g. because of timeout because of long running script)

spark.sql("SELECT 1")
# Cluster starts again automatically
# output: DataFrame[1: int]

databricks-connect-14.3 and Runtime 14.3 Cluster:

from databricks.connect import DatabricksSession

spark = DatabricksSession.builder.profile("DEBUGGING_133").getOrCreate()
spark.sql("SELECT 1")
# output: DataFrame[1: int]

# >>> Databricks cluster shuts down (e.g. because of timeout because of long running script)

spark.sql("SELECT 1")
# Cluster starts again automatically
# output: 
Traceback (most recent call last):
  File "****\lib\site-packages\IPython\core\interactiveshell.py", line 3508, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-7-e8eb9b165388>", line 1, in <module>
    spark.sql("SELECT 1")
  File "****\lib\site-packages\pyspark\sql\connect\session.py", line 572, in sql
    data, properties = self.client.execute_command(cmd.command(self._client))
  File "****\lib\site-packages\pyspark\sql\connect\client\core.py", line 1139, in execute_command
    data, _, _, _, properties = self._execute_and_fetch(req, observations or {})
  File "****\lib\site-packages\pyspark\sql\connect\client\core.py", line 1515, in _execute_and_fetch
    for response in self._execute_and_fetch_as_iterator(req, observations):
  File "****\lib\site-packages\pyspark\sql\connect\client\core.py", line 1493, in _execute_and_fetch_as_iterator
    self._handle_error(error)
  File "****\lib\site-packages\pyspark\sql\connect\client\core.py", line 1805, in _handle_error
    raise error
  File "****\lib\site-packages\pyspark\sql\connect\client\core.py", line 1486, in _execute_and_fetch_as_iterator
    yield from handle_response(b)
  File "****\lib\site-packages\pyspark\sql\connect\client\core.py", line 1406, in handle_response
    self._verify_response_integrity(b)
  File "****\lib\site-packages\pyspark\sql\connect\client\core.py", line 1937, in _verify_response_integrity
    raise PySparkAssertionError(
pyspark.errors.exceptions.base.PySparkAssertionError: Received incorrect server side session identifier for request. Please create a new Spark Session to reconnect. (ab413162-708a-423f-84c7-b04969ed3bf4 != 3c8ea3e4-e20f-4a31-82a0-ff938f4017c6)

Is this maybe a bug? Where can I see known issues or report this?