cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks Connect Scala -

dollyb
New Contributor III

Hi,

I'm using Databricks Connect to run Scala code from IntelliJ on a Databricks single node cluster.

Even with the simplest code, I'm experiencing this error:

org.apache.spark.SparkException: grpc_shaded.io.grpc.StatusRuntimeException: INTERNAL: org.apache.spark.sql.types.StructType; local class incompatible: stream classdesc serialVersionUID = -2957078008500330718, local class serialVersionUID = 7842785351289879144

Creating and processing dataframes works, but as soon as I try to do the simplest processing it fails.

Minimal code example to reproduce:

val df = spark.read.table("samples.nyctaxi.trips")
import spark.implicits._
df
.map(_.getAs[Int]("dropoff_zip"))
.show(10)

Happens with both 13.3 LTS and 14.3 LTS. Databricks Connect dependency has the same version as the cluster, Scala is 2.12.15, JDK 8 Azul.

Same code works fine in a notebook.

13 REPLIES 13

dollyb
New Contributor III

Forgot to add that I included the code as described in the docs:

val sourceLocation = getClass.getProtectionDomain.getCodeSource.getLocation.toURI
DatabricksSession.builder()
.clusterId(clusterId)
  .addCompiledArtifacts(sourceLocation)
.getOrCreate()

 

 

 

-werners-
Esteemed Contributor III

can you check your build.sbt?
https://docs.databricks.com/en/dev-tools/databricks-connect/scala/index.html

Also, in your session builder I do not see the remote()  or sdkconfig() part.
Can you go through the docs and check everything? 
It should work, checked myself last week.

dollyb
New Contributor III

I left that out, my connection looks like this:

val spark: SparkSession =
DatabricksSession.builder()
.host("xxx")
.token("xxx")
.clusterId("xxx")
.addCompiledArtifacts(sourceLocation) // tried with and without this
.getOrCreate()

-werners-
Esteemed Contributor III

I notice you call the addcompiledartifacts API, that is used for UDFs packed in a jar that is installed on the cluster.

https://docs.databricks.com/en/dev-tools/databricks-connect/scala/udf.htmlIs that the case for you?  It seems you only want to run the default example.

dollyb
New Contributor III

The documentation states: "The same mechanism described in the preceding section for UDFs also applies to typed Dataset APIs.".

My

map(_.getAs[Int]("dropoff_zip"))

is like a UDF, so that's why I'm adding the compiled source.

(I also had to do it in a similar way when trying Spark Connect against a Spark 3.5.0 cluster, and it ran successfully).

By the way, as soon as I leave out the .map(), it runs, so the error has to do with user functions / Dataset API.

-werners-
Esteemed Contributor III

I see, so it can't be the connection.
does importing udf help?  Just guessing here (after reading the docs for typed  dataset api)

dollyb
New Contributor III

Using a proper UDF does indeed work:

val myUdf = udf { row: Int =>
row * 5
}
df.withColumn("dropoff_zip_processed", myUdf($"dropoff_zip"))

It's just the Dataset API that doesn't work.

 

dollyb
New Contributor III

So this is clearly a bug in Databricks Connect. I'm not on a support plan, so not sure how to report a bug on this...

dollyb
New Contributor III

I also tried on a shared cluster, and the error message is pretty clear

org.sparkproject.io.grpc.StatusRuntimeException: INVALID_ARGUMENT: User defined code is not yet supported.

-werners-
Esteemed Contributor III

that is pretty clear indeed.
But according to the docs it should be supported.
Since scala support only went GA on 1st of feb 2024, chances are we are talking about a bug here.

Are you sure you added the correct databricks connect jar? (14.3)

dollyb
New Contributor III

Yes, I tried both 14.3.0 and 14.3.1.

I'm also encountering the same (or very similar) error when firing against a local Spark Connect cluster. When I replace databricks-connect with spark-connect, it works.

I sent a bug report to help@databricks.com.

-werners-
Esteemed Contributor III

nice find.
definitely a bug if it works in spark-connect.

dollyb
New Contributor III

I just hope Databricks will pay attention to it.