Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
Showing results for 
Search instead for 
Did you mean: 

Installing Databricks Connect breaks pyspark local cluster mode

New Contributor III

Hi, It seems that when databricks-connect is installed, pyspark is at the same time modified so that it will not anymore work with local master node. This has been especially useful in testing, when unit tests for spark-related code without any remote session.

Without databricks-connect this code works fine to initialize local spark session:

spark = SparkSession.Builder().master("local[1]").getOrCreate()

However, when databricks-connect python package is installed that same code fails with 

> RuntimeError: Only remote Spark sessions using Databricks Connect are supported. Could not find connection parameters to start a Spark remote session.

Question: Why does it work like this? Also, is this documented somewhere? I do not see it mentioned in Databricks Connect Troubleshooting or Limitations documentation pages. Same issue has been asked at Running pytest with local spark session · Issue #1152 · databricks/databricks-vscode · GitHub.


New Contributor III

Hi, I undestand Databricks Connect is used for (that why I'm trying it out) but I would also like to be able to run tests. What do you mean with "different local mode"? 

As a side-topic, I tried running pytest tests with Databricks Connect session (both spark-connect server running in container at sc://localhost or Azure Databricks via DatabricksSession) and some of the tests fail with "Windows fatal exception: access violation" in both cases so that doesn't really work either.

New Contributor III

Could you please provide an example of the proposed workaround? Nothing what I have tried helps, there is always the same error as in the original post. This is very frustrating to say the least - the inability to switch properly and natively to a local mode.

I look forward to hearing from you.

Kind regards, Dmytro.

This is ridiculous. It's absolutely unacceptable as "intended behavior" for a professional software package to clobber the functionality of another package just by being installed.

New Contributor III

Indeed. This becomes more obvious when I was looking at the databricks-connect wheel package contents, and it includes also pyspark package. The pyspark inside it is like 9 MB whereas regular pyspark package is over 300 MB. I guess they've only left spark-connect client side parts and removed whole server thing. Makes kind of sense but it should not be done in a way to replace existing package.

I even tried connecting to local (docker-hosted) Spark but it crashes on some test cases.

New Contributor II

Hey guys. I am facing the same issue. Before databricks connect the unit tests with pytest were working properly.

I´ve tried even to create a newSession()  and use pytest-spark library but both approach did not work.

I got the following error " 

E RuntimeError: Only remote Spark sessions using Databricks Connect are supported. Use DatabricksSession.builder to create a remote Spark session instead.
E Refer to on how to configure Databricks Connect.

.venv\Lib\site-packages\pyspark\sql\ RuntimeError
===================================================================== short test summary info ======================================================================
ERROR tests/unit/ - RuntimeError: Only remote Spark sessions using Databricks Connect are supported. Use DatabricksSession.builder to create a remote Spark session instead.
========================================================================= 1 error in 0.17s ========================================================================= "

Let´s wait a solution. In case I have any news I update the topic here. Cheers.

New Contributor II

Also frustrated by this behavior. Databricks-connect should not replace the rest of local spark.

Is there any solution to this?

New Contributor III

I managed to work around it in Poetry by using optional dependency groups, and then when I want to switch between Databricks Connect and local PySpark functionality I run this:

poetry install --with <group x> --without <group y> --sync 


New Contributor II

Hi, we are facing this issue as well, i.e. RuntimeError as reported in this comment. We use the workaround with poetry groups as suggested in this comment.

The workaround introduces unnecessary an non-intuitive complexity to dependency management and provides potential space for introducing errors.

Is there any plan to fix this behaviour?

Databricks Employee
Databricks Employee

Databricks-Connect is by design a drop-in replacement for pyspark, essentially. It transparently takes over execution of the Spark parts without change to the program, and is definitely a different 'environment' from local pyspark.

As with any situation where you need to deal with separate software environments, you'd typically have a separate venv for each, and as such could have pyspark in one and databricks-connect in another. Does that not answer this?

New Contributor III

Users should be able to have a single Python environment setup with a single set of Python dependencies specified (in pyproject.toml or similar) and installed, and alternately point their code at either a local or remote Spark cluster simply by changing the URL they pass to `DatabricksSession.builder.remote(...)`.

As far as I can tell, this is not possible. Databricks Connect does not work with a local, open source Spark Connect server (i.e. what you get when you run `sbin/`). And open source Spark Connect does not work with a remote Databricks cluster. And as others have pointed out, installing `databricks-connect` and `pyspark` side-by-side yields a broken Python environment.

No one wants to have a Databricks cluster running 24/7 just so they can run their tests quickly. And having multiple Python environments just to handle these incompatibilities means being forced to abandon modern Python packaging tooling like Poetry and going back to manually wrangling venvs.

Is there a design reason Databricks cannot simply enable Databricks Connect to work with open source Spark Connect servers?

Databricks Employee
Databricks Employee

I don't know the details but it's not quite the same 'connect'

But I think you can simply have two venvs if you want - one with Connect and one with local pyspark. You probably do want to treat these as distinct environments; they are different environments. This is what virtual environments are there for IMHO.

New Contributor III

I don't see how this works with modern Python packaging tooling and standards.

How would your `pyproject.toml` look to support these multiple environments? How would modern build tools (like Poetry, Pipenv, Hatch, etc.) build/publish your project? How would someone build continuous integration testing?

What I'm understanding is that with databricks-connect you have to basically abandon modern Python packaging if you want to be able to run tests locally. You need to manually maintain multiple `requirements.txt` files and matching venvs, and manually switch between one and the other depending on whether your target is a local Spark cluster or remote one. As you switch back and forth, you'll probably need to futz with your IDE's config so that type and lint checks don't break. And to package your application for deployment, you'll have to build an sdist using a hand-rolled script; no `poetry build` for you.

This all feels very kludgy and outdated. There should be a better way, one that works naturally with modern Python packaging standards.

New Contributor II

How can I configure the environment in pytest.ini?

I need a local Spark session for unit testing, but I encountered the following error: RuntimeError: Only remote Spark sessions using Databricks Connect are supported. 

Databricks Employee
Databricks Employee

If you want to run Spark locally, you simply use pyspark!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group