- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-14-2025 08:45 AM - edited 02-14-2025 08:48 AM
I use databricks-connect for local IDE-based (pycharm) development of databrick-jobs. New databricks-connect with DatabricksSession made me a lot of trouble, since I needed to maintain two separate import system for local development, and for job execution on databricks. And this messy solution was suggested by databricks here.
I think i found a workaround to this issue. I haven't tested it deeply, but basic functionality seems to be working.
With this I can create and use SparkSession from my IDE, connecting to remote databricks cluster with DBR15.4 (with Apache Spark 3.5.0).
So my solution is the next:
I've removed databricks-connect package and installed pyspark 3.5.0 instead. To access my remote databricks cluster I use spark-connect and its SPARK_REMOTE env variable, which looks something like this:
SPARK_REMOTE=sc://blahlbah.cloud.databricks.com:443/;token=mytoken;x-databricks-cluster-id=myclusterid
I built the value of the variable is based on this documentation.
After configuring the env var for a python script execution I can use SparkSession as before:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()