topic Re: Spark Structured Streaming foreachBatch with databricks-connect in Data Engineering

Spark Structured Streaming foreachBatch with databricks-connect

olivier-soucy — Tue, 22 Oct 2024 16:39:18 GMT

Hello!

I'm trying to use the foreachBatch method of a Spark Streaming DataFrame with databricks-connect. Given that spark connect supported was added to `foreachBatch` in 3.5.0, I was expecting this to work.

Configuration:
- DBR 15.4 (Spark 3.5.0)
- databricks-connect 15.4.2

Code:

import os from databricks.connect import DatabricksSession # Setup spark = DatabricksSession.builder.clusterId("0501-011833-vcux5w7j").getOrCreate() # Execute df = spark.readStream.table("brz_stock_prices_job") def update_metrics(batch_df, batch_id): size = batch_df.count() print(f"Batch size: {size}")

Error:

File "/Users/osoucy/miniconda3/envs/lac/lib/python3.10/site-packages/pyspark/sql/connect/client/core.py", line 2149, in _handle_rpc_error raise convert_exception( pyspark.errors.exceptions.connect.SparkConnectGrpcException: (java.io.IOException) Connection reset by peer JVM stacktrace: java.io.IOException at sun.nio.ch.FileDispatcherImpl.read0(FileDispatcherImpl.java:-2) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:197) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) at sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:208) at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read(BufferedInputStream.java:265) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.spark.api.python.StreamingPythonRunner.init(StreamingPythonRunner.scala:206) at org.apache.spark.sql.connect.planner.StreamingForeachBatchHelper$.$anonfun$pythonForeachBatchWrapper$3(StreamingForeachBatchHelper.scala:146) [...] at org.apache.spark.sql.connect.execution.ExecuteThreadRunner$ExecutionThread.run(ExecuteThreadRunner.scala:561)

Any help would be appreciated!

Re: Spark Structured Streaming foreachBatch with databricks-connect

VZLA — Thu, 31 Oct 2024 17:10:07 GMT

Is this by any chance submitted to an UC enabled assigned cluster?

Re: Spark Structured Streaming foreachBatch with databricks-connect

olivier-soucy — Fri, 01 Nov 2024 02:52:06 GMT

It's a single user UC enabled cluster.

Re: Spark Structured Streaming foreachBatch with databricks-connect

olivier-soucy — Fri, 01 Nov 2024 02:53:57 GMT

It turned out that my local machine had a different python version when compared to the workers. Updating python solved this issue. I simply had to look into the driver log, the error message was very obvious:

pyspark.errors.exceptions.base.PySparkRuntimeError: [PYTHON_VERSION_MISMATCH] Python in worker has different version: 3.11 than that in driver: 3.10, PySpark cannot run with different minor versions. Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

Re: Spark Structured Streaming foreachBatch with databricks-connect

VZLA — Fri, 01 Nov 2024 08:18:34 GMT

Thanks for sharing the solution! Just curious, was the original error message reported in this post in the Driver log as well?

Re: Spark Structured Streaming foreachBatch with databricks-connect

olivier-soucy — Sat, 02 Nov 2024 06:00:18 GMT

From I can remember, I think it was!