- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-16-2024 01:48 PM
I would like to create a regular PySpark session in an isolated environment against which I can run my Spark based tests. I don't see how that's possible with the new Databricks Connect. I'm going in circles here, is it even possible?
I don't want to connect to some cluster or anywhere really. I want to be able to run my tests as per usual, without access to internet.
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-18-2024 01:16 AM
Ok, so the best solution as it stands today (for me personally at least) is this:
- Have pyspark ^3.4 installed with the connect extra feature.
- My unit tests then don't have to change at all, as they use the regular spark session created on the fly
- For running the script locally while taking advantage of Databricks, I use the open source Spark Connect and then set the SPARK_REMOTE=sc://${WORKSPACE_INSTANCE_NAME}:443/;token=${PERSONAL_ACCESS_TOKEN};x-databricks-cluster-id=${CLUSTER_ID}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-17-2024 12:28 AM
I'd not use databricks connect/spark connect in that case.
Instead run spark locally. Of course you will not have databricks specific tools (like dbutils etc)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-17-2024 01:38 AM
Problem is that I don't see how you can have both spark native and Databricks Connect (Spark Connect). The guidelines suggest one or the other, which is a bit of a pickle.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-17-2024 03:18 AM
you could try to separate the environments f.e. using containers/vm's.
Probably there are other ways too, but these immediately came to mind.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-18-2024 01:16 AM
Ok, so the best solution as it stands today (for me personally at least) is this:
- Have pyspark ^3.4 installed with the connect extra feature.
- My unit tests then don't have to change at all, as they use the regular spark session created on the fly
- For running the script locally while taking advantage of Databricks, I use the open source Spark Connect and then set the SPARK_REMOTE=sc://${WORKSPACE_INSTANCE_NAME}:443/;token=${PERSONAL_ACCESS_TOKEN};x-databricks-cluster-id=${CLUSTER_ID}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-27-2024 11:51 PM
Given this doesn't work on serverless compute, aren't those tests very slow to complete due to the compute startup time? I'm trying to steer away from databricks connect for unit testing for this reason. If they supported serverless, that would be a different story.

