Re: Installing Databricks Connect breaks pyspark l...

beliz · ‎02-14-2025

I use databricks-connect for local IDE-based (pycharm) development of databrick-jobs. New databricks-connect with DatabricksSession made me a lot of trouble, since I needed to maintain two separate import system for local development, and for job execution on databricks. And this messy solution was suggested by databricks here.

I think i found a workaround to this issue. I haven't tested it deeply, but basic functionality seems to be working.

With this I can create and use SparkSession from my IDE, connecting to remote databricks cluster with DBR15.4 (with Apache Spark 3.5.0).

So my solution is the next:

I've removed databricks-connect package and installed pyspark 3.5.0 instead. To access my remote databricks cluster I use spark-connect and its SPARK_REMOTE env variable, which looks something like this:

SPARK_REMOTE=sc://blahlbah.cloud.databricks.com:443/;token=mytoken;x-databricks-cluster-id=myclusterid

I built the value of the variable is based on this documentation.

After configuring the env var for a python script execution I can use SparkSession as before:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()