cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Installing Databricks Connect breaks pyspark local cluster mode

htu
New Contributor III

Hi, It seems that when databricks-connect is installed, pyspark is at the same time modified so that it will not anymore work with local master node. This has been especially useful in testing, when unit tests for spark-related code without any remote session.

Without databricks-connect this code works fine to initialize local spark session:

spark = SparkSession.Builder().master("local[1]").getOrCreate()

However, when databricks-connect python package is installed that same code fails with 

> RuntimeError: Only remote Spark sessions using Databricks Connect are supported. Could not find connection parameters to start a Spark remote session.

Question: Why does it work like this? Also, is this documented somewhere? I do not see it mentioned in Databricks Connect Troubleshooting or Limitations documentation pages. Same issue has been asked at Running pytest with local spark session · Issue #1152 · databricks/databricks-vscode · GitHub.

9 REPLIES 9

htu
New Contributor III

Hi, I undestand Databricks Connect is used for (that why I'm trying it out) but I would also like to be able to run tests. What do you mean with "different local mode"? 

As a side-topic, I tried running pytest tests with Databricks Connect session (both spark-connect server running in container at sc://localhost or Azure Databricks via DatabricksSession) and some of the tests fail with "Windows fatal exception: access violation" in both cases so that doesn't really work either.

dmytro
New Contributor III

Could you please provide an example of the proposed workaround? Nothing what I have tried helps, there is always the same error as in the original post. This is very frustrating to say the least - the inability to switch properly and natively to a local mode.

I look forward to hearing from you.

Kind regards, Dmytro.

This is ridiculous. It's absolutely unacceptable as "intended behavior" for a professional software package to clobber the functionality of another package just by being installed.

htu
New Contributor III

Indeed. This becomes more obvious when I was looking at the databricks-connect wheel package contents, and it includes also pyspark package. The pyspark inside it is like 9 MB whereas regular pyspark package is over 300 MB. I guess they've only left spark-connect client side parts and removed whole server thing. Makes kind of sense but it should not be done in a way to replace existing package.

I even tried connecting to local (docker-hosted) Spark but it crashes on some test cases.

dpires92
New Contributor II

Hey guys. I am facing the same issue. Before databricks connect the unit tests with pytest were working properly.

I´ve tried even to create a newSession()  and use pytest-spark library but both approach did not work.

I got the following error " 

E RuntimeError: Only remote Spark sessions using Databricks Connect are supported. Use DatabricksSession.builder to create a remote Spark session instead.
E Refer to https://docs.databricks.com/dev-tools/databricks-connect.html on how to configure Databricks Connect.

.venv\Lib\site-packages\pyspark\sql\session.py:552: RuntimeError
===================================================================== short test summary info ======================================================================
ERROR tests/unit/test_functions.py::test_if_function_add_year_month - RuntimeError: Only remote Spark sessions using Databricks Connect are supported. Use DatabricksSession.builder to create a remote Spark session instead.
========================================================================= 1 error in 0.17s ========================================================================= "

Let´s wait a solution. In case I have any news I update the topic here. Cheers.

Kolath
New Contributor II

Also frustrated by this behavior. Databricks-connect should not replace the rest of local spark.

Is there any solution to this?

Angus-Dawson
New Contributor III

I managed to work around it in Poetry by using optional dependency groups, and then when I want to switch between Databricks Connect and local PySpark functionality I run this:

poetry install --with <group x> --without <group y> --sync 

 

lukany
New Contributor II

Hi, we are facing this issue as well, i.e. RuntimeError as reported in this comment. We use the workaround with poetry groups as suggested in this comment.

The workaround introduces unnecessary an non-intuitive complexity to dependency management and provides potential space for introducing errors.

Is there any plan to fix this behaviour?

sean_owen
Databricks Employee
Databricks Employee

Databricks-Connect is by design a drop-in replacement for pyspark, essentially. It transparently takes over execution of the Spark parts without change to the program, and is definitely a different 'environment' from local pyspark.

As with any situation where you need to deal with separate software environments, you'd typically have a separate venv for each, and as such could have pyspark in one and databricks-connect in another. Does that not answer this?

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group