cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Unit Testing with the new Databricks Connect in Python

cosminsanda
New Contributor III

I would like to create a regular PySpark session in an isolated environment against which I can run my Spark based tests. I don't see how that's possible with the new Databricks Connect. I'm going in circles here, is it even possible?

I don't want to connect to some cluster or anywhere really. I want to be able to run my tests as per usual, without access to internet.

1 ACCEPTED SOLUTION

Accepted Solutions

cosminsanda
New Contributor III

Ok, so the best solution as it stands today (for me personally at least) is this:

  • Have pyspark ^3.4 installed with the connect extra feature.
  • My unit tests then don't have to change at all, as they use the regular spark session created on the fly
  • For running the script locally while taking advantage of Databricks, I use the open source Spark Connect and then set the SPARK_REMOTE=sc://${WORKSPACE_INSTANCE_NAME}:443/;token=${PERSONAL_ACCESS_TOKEN};x-databricks-cluster-id=${CLUSTER_ID}

View solution in original post

4 REPLIES 4

-werners-
Esteemed Contributor III

I'd not use databricks connect/spark connect in that case.
Instead run spark locally.  Of course you will not have databricks specific tools (like dbutils etc)

cosminsanda
New Contributor III

Problem is that I don't see how you can have both spark native and Databricks Connect (Spark Connect). The guidelines suggest one or the other, which is a bit of a pickle.

-werners-
Esteemed Contributor III

you could try to separate the environments f.e. using containers/vm's. 
Probably there are other ways too, but these immediately came to mind.

cosminsanda
New Contributor III

Ok, so the best solution as it stands today (for me personally at least) is this:

  • Have pyspark ^3.4 installed with the connect extra feature.
  • My unit tests then don't have to change at all, as they use the regular spark session created on the fly
  • For running the script locally while taking advantage of Databricks, I use the open source Spark Connect and then set the SPARK_REMOTE=sc://${WORKSPACE_INSTANCE_NAME}:443/;token=${PERSONAL_ACCESS_TOKEN};x-databricks-cluster-id=${CLUSTER_ID}
Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.