cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Unit Testing with the new Databricks Connect in Python

cosminsanda
New Contributor III

I would like to create a regular PySpark session in an isolated environment against which I can run my Spark based tests. I don't see how that's possible with the new Databricks Connect. I'm going in circles here, is it even possible?

I don't want to connect to some cluster or anywhere really. I want to be able to run my tests as per usual, without access to internet.

1 ACCEPTED SOLUTION

Accepted Solutions

cosminsanda
New Contributor III

Ok, so the best solution as it stands today (for me personally at least) is this:

  • Have pyspark ^3.4 installed with the connect extra feature.
  • My unit tests then don't have to change at all, as they use the regular spark session created on the fly
  • For running the script locally while taking advantage of Databricks, I use the open source Spark Connect and then set the SPARK_REMOTE=sc://${WORKSPACE_INSTANCE_NAME}:443/;token=${PERSONAL_ACCESS_TOKEN};x-databricks-cluster-id=${CLUSTER_ID}

View solution in original post

5 REPLIES 5

-werners-
Esteemed Contributor III

I'd not use databricks connect/spark connect in that case.
Instead run spark locally.  Of course you will not have databricks specific tools (like dbutils etc)

cosminsanda
New Contributor III

Problem is that I don't see how you can have both spark native and Databricks Connect (Spark Connect). The guidelines suggest one or the other, which is a bit of a pickle.

-werners-
Esteemed Contributor III

you could try to separate the environments f.e. using containers/vm's. 
Probably there are other ways too, but these immediately came to mind.

cosminsanda
New Contributor III

Ok, so the best solution as it stands today (for me personally at least) is this:

  • Have pyspark ^3.4 installed with the connect extra feature.
  • My unit tests then don't have to change at all, as they use the regular spark session created on the fly
  • For running the script locally while taking advantage of Databricks, I use the open source Spark Connect and then set the SPARK_REMOTE=sc://${WORKSPACE_INSTANCE_NAME}:443/;token=${PERSONAL_ACCESS_TOKEN};x-databricks-cluster-id=${CLUSTER_ID}

thibault
Contributor II

Given this doesn't work on serverless compute, aren't those tests very slow to complete due to the compute startup time? I'm trying to steer away from databricks connect for unit testing for this reason. If they supported serverless, that would be a different story.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group