Databricks Community

olivier-soucy · ‎10-22-2024

Hello!

I'm trying to use the foreachBatch method of a Spark Streaming DataFrame with databricks-connect. Given that spark connect supported was added to `foreachBatch` in 3.5.0, I was expecting this to work.

Configuration:
- DBR 15.4 (Spark 3.5.0)
- databricks-connect 15.4.2

Code:

import os
from databricks.connect import DatabricksSession

# Setup
spark = DatabricksSession.builder.clusterId("0501-011833-vcux5w7j").getOrCreate()

# Execute
df = spark.readStream.table("brz_stock_prices_job")

def update_metrics(batch_df, batch_id):
    size = batch_df.count()
    print(f"Batch size: {size}")

Error:

  File "/Users/osoucy/miniconda3/envs/lac/lib/python3.10/site-packages/pyspark/sql/connect/client/core.py", line 2149, in _handle_rpc_error
    raise convert_exception(
pyspark.errors.exceptions.connect.SparkConnectGrpcException: (java.io.IOException) 

Connection reset by peer
JVM stacktrace:
java.io.IOException
	at sun.nio.ch.FileDispatcherImpl.read0(FileDispatcherImpl.java:-2)
	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
	at sun.nio.ch.IOUtil.read(IOUtil.java:197)
	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
	at sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:208)
	at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103)
	at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
	at java.io.DataInputStream.readInt(DataInputStream.java:387)
	at org.apache.spark.api.python.StreamingPythonRunner.init(StreamingPythonRunner.scala:206)
	at org.apache.spark.sql.connect.planner.StreamingForeachBatchHelper$.$anonfun$pythonForeachBatchWrapper$3(StreamingForeachBatchHelper.scala:146)
        [...]
	at org.apache.spark.sql.connect.execution.ExecuteThreadRunner$ExecutionThread.run(ExecuteThreadRunner.scala:561)

Any help would be appreciated!

-> www.laktory.ai

olivier-soucy · ‎10-31-2024

It turned out that my local machine had a different python version when compared to the workers. Updating python solved this issue. I simply had to look into the driver log, the error message was very obvious:

pyspark.errors.exceptions.base.PySparkRuntimeError: [PYTHON_VERSION_MISMATCH] Python in worker has different version: 3.11 than that in driver: 3.10, PySpark cannot run with different minor versions.
Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

-> www.laktory.ai

View solution in original post

VZLA · ‎10-31-2024

Is this by any chance submitted to an UC enabled assigned cluster?

olivier-soucy · ‎10-31-2024

It's a single user UC enabled cluster.

-> www.laktory.ai

olivier-soucy · ‎10-31-2024

It turned out that my local machine had a different python version when compared to the workers. Updating python solved this issue. I simply had to look into the driver log, the error message was very obvious:

pyspark.errors.exceptions.base.PySparkRuntimeError: [PYTHON_VERSION_MISMATCH] Python in worker has different version: 3.11 than that in driver: 3.10, PySpark cannot run with different minor versions.
Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

-> www.laktory.ai