Error in Spark Streaming with foreachBatch and Dat...

TWib · ‎05-12-2024

The following code throws an error locally in my IDE with Databricks-connect.

from databricks.connect import DatabricksSession

spark = DatabricksSession.builder.getOrCreate()
spark.sql("CREATE DATABASE IF NOT EXISTS sample")
spark.sql("DROP TABLE IF EXISTS sample.mvp")
spark.sql("DROP TABLE IF EXISTS sample.mvp_from_foreach_batch")

data = [("John", "Doe", 30), ("Jane", "Doe", 25), ("Mike", "Johnson", 35)]
df = spark.createDataFrame(data, ["FirstName", "LastName", "Age"])
df.write.format("delta").mode("overwrite").saveAsTable("sample.mvp")

def foreach_batch_function(df, epoch_id):
    df.write.format("delta").mode("overwrite").saveAsTable(
        "sample.mvp_from_foreach_batch"
    )

spark.readStream.table("sample.mvp").writeStream.foreachBatch(
    foreach_batch_function
).outputMode("append").trigger(availableNow=True).start().awaitTermination()

This code only works in notebooks or directly on a cluster. It will not run locally in an IDE with Databricks Connect.

Instead error

pyspark.errors.exceptions.connect.SparkException: No PYTHON_UID found for session (some uid) is raised

In general, Databricks Connect works fine for all other cases.

My local environment:

databricks-connect 14.3.1
databricks-sdk 0.26.0
pyspark 3.5.1
Python 3.11.4

Cluster running on

15.1 (includes Apache Spark 3.5.0, Scala 2.12)
Single User Mode

Error in Spark Streaming with foreachBatch and Databricks Connect