Databricks Connect Scala -
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-21-2024 11:15 PM - edited 02-21-2024 11:16 PM
Hi,
I'm using Databricks Connect to run Scala code from IntelliJ on a Databricks single node cluster.
Even with the simplest code, I'm experiencing this error:
org.apache.spark.SparkException: grpc_shaded.io.grpc.StatusRuntimeException: INTERNAL: org.apache.spark.sql.types.StructType; local class incompatible: stream classdesc serialVersionUID = -2957078008500330718, local class serialVersionUID = 7842785351289879144
Creating and processing dataframes works, but as soon as I try to do the simplest processing it fails.
Minimal code example to reproduce:
val df = spark.read.table("samples.nyctaxi.trips")
import spark.implicits._
df
.map(_.getAs[Int]("dropoff_zip"))
.show(10)
Happens with both 13.3 LTS and 14.3 LTS. Databricks Connect dependency has the same version as the cluster, Scala is 2.12.15, JDK 8 Azul.
Same code works fine in a notebook.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-22-2024 04:44 AM
Forgot to add that I included the code as described in the docs:
val sourceLocation = getClass.getProtectionDomain.getCodeSource.getLocation.toURI
DatabricksSession.builder()
.clusterId(clusterId)
.addCompiledArtifacts(sourceLocation)
.getOrCreate()
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-22-2024 05:34 AM
can you check your build.sbt?
https://docs.databricks.com/en/dev-tools/databricks-connect/scala/index.html
Also, in your session builder I do not see the remote() or sdkconfig() part.
Can you go through the docs and check everything?
It should work, checked myself last week.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-22-2024 06:20 AM
I left that out, my connection looks like this:
val spark: SparkSession =
DatabricksSession.builder()
.host("xxx")
.token("xxx")
.clusterId("xxx")
.addCompiledArtifacts(sourceLocation) // tried with and without this
.getOrCreate()
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-22-2024 05:39 AM
I notice you call the addcompiledartifacts API, that is used for UDFs packed in a jar that is installed on the cluster.
https://docs.databricks.com/en/dev-tools/databricks-connect/scala/udf.htmlIs that the case for you? It seems you only want to run the default example.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-22-2024 06:25 AM - edited 02-22-2024 06:29 AM
The documentation states: "The same mechanism described in the preceding section for UDFs also applies to typed Dataset APIs.".
My
map(_.getAs[Int]("dropoff_zip"))
is like a UDF, so that's why I'm adding the compiled source.
(I also had to do it in a similar way when trying Spark Connect against a Spark 3.5.0 cluster, and it ran successfully).
By the way, as soon as I leave out the .map(), it runs, so the error has to do with user functions / Dataset API.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-22-2024 07:14 AM
I see, so it can't be the connection.
does importing udf help? Just guessing here (after reading the docs for typed dataset api)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-22-2024 08:04 AM
Using a proper UDF does indeed work:
val myUdf = udf { row: Int =>
row * 5
}
df.withColumn("dropoff_zip_processed", myUdf($"dropoff_zip"))
It's just the Dataset API that doesn't work.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-22-2024 11:14 PM
So this is clearly a bug in Databricks Connect. I'm not on a support plan, so not sure how to report a bug on this...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-23-2024 12:03 AM
I also tried on a shared cluster, and the error message is pretty clear
org.sparkproject.io.grpc.StatusRuntimeException: INVALID_ARGUMENT: User defined code is not yet supported.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-23-2024 12:29 AM
that is pretty clear indeed.
But according to the docs it should be supported.
Since scala support only went GA on 1st of feb 2024, chances are we are talking about a bug here.
Are you sure you added the correct databricks connect jar? (14.3)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-23-2024 02:16 AM - edited 02-23-2024 02:18 AM
Yes, I tried both 14.3.0 and 14.3.1.
I'm also encountering the same (or very similar) error when firing against a local Spark Connect cluster. When I replace databricks-connect with spark-connect, it works.
I sent a bug report to help@databricks.com.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-23-2024 03:40 AM
nice find.
definitely a bug if it works in spark-connect.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-23-2024 07:38 AM
I just hope Databricks will pay attention to it.