Databricks Community

dollyb · ‎02-21-2024

Hi,

I'm using Databricks Connect to run Scala code from IntelliJ on a Databricks single node cluster.

Even with the simplest code, I'm experiencing this error:

org.apache.spark.SparkException: grpc_shaded.io.grpc.StatusRuntimeException: INTERNAL: org.apache.spark.sql.types.StructType; local class incompatible: stream classdesc serialVersionUID = -2957078008500330718, local class serialVersionUID = 7842785351289879144

Creating and processing dataframes works, but as soon as I try to do the simplest processing it fails.

Minimal code example to reproduce:

val df = spark.read.table("samples.nyctaxi.trips")
import spark.implicits._
df
  .map(_.getAs[Int]("dropoff_zip"))
  .show(10)

Happens with both 13.3 LTS and 14.3 LTS. Databricks Connect dependency has the same version as the cluster, Scala is 2.12.15, JDK 8 Azul.

Same code works fine in a notebook.

dollyb · ‎02-22-2024

Forgot to add that I included the code as described in the docs:

val sourceLocation = getClass.getProtectionDomain.getCodeSource.getLocation.toURI

DatabricksSession.builder()
  .clusterId(clusterId)

  .addCompiledArtifacts(sourceLocation)
  .getOrCreate()

-werners- · ‎02-22-2024

can you check your build.sbt?
https://docs.databricks.com/en/dev-tools/databricks-connect/scala/index.html

Also, in your session builder I do not see the remote() or sdkconfig() part.
Can you go through the docs and check everything?
It should work, checked myself last week.

dollyb · ‎02-22-2024

I left that out, my connection looks like this:

val spark: SparkSession =
  DatabricksSession.builder()
    .host("xxx")
    .token("xxx")
    .clusterId("xxx")
    .addCompiledArtifacts(sourceLocation) // tried with and without this
    .getOrCreate()

-werners- · ‎02-22-2024

I notice you call the addcompiledartifacts API, that is used for UDFs packed in a jar that is installed on the cluster.

https://docs.databricks.com/en/dev-tools/databricks-connect/scala/udf.htmlIs that the case for you? It seems you only want to run the default example.

dollyb · ‎02-22-2024

The documentation states: "The same mechanism described in the preceding section for UDFs also applies to typed Dataset APIs.".

My

map(_.getAs[Int]("dropoff_zip"))

is like a UDF, so that's why I'm adding the compiled source.

(I also had to do it in a similar way when trying Spark Connect against a Spark 3.5.0 cluster, and it ran successfully).

By the way, as soon as I leave out the .map(), it runs, so the error has to do with user functions / Dataset API.

-werners- · ‎02-22-2024

I see, so it can't be the connection.
does importing udf help? Just guessing here (after reading the docs for typed dataset api)

dollyb · ‎02-22-2024

Using a proper UDF does indeed work:

val myUdf = udf { row: Int =>
  row * 5
}
df.withColumn("dropoff_zip_processed", myUdf($"dropoff_zip"))

It's just the Dataset API that doesn't work.

dollyb · ‎02-22-2024

So this is clearly a bug in Databricks Connect. I'm not on a support plan, so not sure how to report a bug on this...

dollyb · ‎02-23-2024

I also tried on a shared cluster, and the error message is pretty clear

org.sparkproject.io.grpc.StatusRuntimeException: INVALID_ARGUMENT: User defined code is not yet supported.

-werners- · ‎02-23-2024

that is pretty clear indeed.
But according to the docs it should be supported.
Since scala support only went GA on 1st of feb 2024, chances are we are talking about a bug here.

Are you sure you added the correct databricks connect jar? (14.3)

dollyb · ‎02-23-2024

Yes, I tried both 14.3.0 and 14.3.1.

I'm also encountering the same (or very similar) error when firing against a local Spark Connect cluster. When I replace databricks-connect with spark-connect, it works.

I sent a bug report to help@databricks.com.