cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks Connect Scala -

dollyb
Contributor

Hi,

I'm using Databricks Connect to run Scala code from IntelliJ on a Databricks single node cluster.

Even with the simplest code, I'm experiencing this error:

org.apache.spark.SparkException: grpc_shaded.io.grpc.StatusRuntimeException: INTERNAL: org.apache.spark.sql.types.StructType; local class incompatible: stream classdesc serialVersionUID = -2957078008500330718, local class serialVersionUID = 7842785351289879144

Creating and processing dataframes works, but as soon as I try to do the simplest processing it fails.

Minimal code example to reproduce:

val df = spark.read.table("samples.nyctaxi.trips")
import spark.implicits._
df
.map(_.getAs[Int]("dropoff_zip"))
.show(10)

Happens with both 13.3 LTS and 14.3 LTS. Databricks Connect dependency has the same version as the cluster, Scala is 2.12.15, JDK 8 Azul.

Same code works fine in a notebook.

13 REPLIES 13

dollyb
Contributor

Forgot to add that I included the code as described in the docs:

val sourceLocation = getClass.getProtectionDomain.getCodeSource.getLocation.toURI
DatabricksSession.builder()
.clusterId(clusterId)
  .addCompiledArtifacts(sourceLocation)
.getOrCreate()

 

 

 

-werners-
Esteemed Contributor III

can you check your build.sbt?
https://docs.databricks.com/en/dev-tools/databricks-connect/scala/index.html

Also, in your session builder I do not see the remote()  or sdkconfig() part.
Can you go through the docs and check everything? 
It should work, checked myself last week.

I left that out, my connection looks like this:

val spark: SparkSession =
DatabricksSession.builder()
.host("xxx")
.token("xxx")
.clusterId("xxx")
.addCompiledArtifacts(sourceLocation) // tried with and without this
.getOrCreate()

-werners-
Esteemed Contributor III

I notice you call the addcompiledartifacts API, that is used for UDFs packed in a jar that is installed on the cluster.

https://docs.databricks.com/en/dev-tools/databricks-connect/scala/udf.htmlIs that the case for you?  It seems you only want to run the default example.

The documentation states: "The same mechanism described in the preceding section for UDFs also applies to typed Dataset APIs.".

My

map(_.getAs[Int]("dropoff_zip"))

is like a UDF, so that's why I'm adding the compiled source.

(I also had to do it in a similar way when trying Spark Connect against a Spark 3.5.0 cluster, and it ran successfully).

By the way, as soon as I leave out the .map(), it runs, so the error has to do with user functions / Dataset API.

-werners-
Esteemed Contributor III

I see, so it can't be the connection.
does importing udf help?  Just guessing here (after reading the docs for typed  dataset api)

Using a proper UDF does indeed work:

val myUdf = udf { row: Int =>
row * 5
}
df.withColumn("dropoff_zip_processed", myUdf($"dropoff_zip"))

It's just the Dataset API that doesn't work.

 

dollyb
Contributor

So this is clearly a bug in Databricks Connect. I'm not on a support plan, so not sure how to report a bug on this...

dollyb
Contributor

I also tried on a shared cluster, and the error message is pretty clear

org.sparkproject.io.grpc.StatusRuntimeException: INVALID_ARGUMENT: User defined code is not yet supported.

-werners-
Esteemed Contributor III

that is pretty clear indeed.
But according to the docs it should be supported.
Since scala support only went GA on 1st of feb 2024, chances are we are talking about a bug here.

Are you sure you added the correct databricks connect jar? (14.3)

Yes, I tried both 14.3.0 and 14.3.1.

I'm also encountering the same (or very similar) error when firing against a local Spark Connect cluster. When I replace databricks-connect with spark-connect, it works.

I sent a bug report to help@databricks.com.

-werners-
Esteemed Contributor III

nice find.
definitely a bug if it works in spark-connect.

dollyb
Contributor

I just hope Databricks will pay attention to it.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group