cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

databricks-connect error when executing sparkml

Troy
New Contributor II

I use databricks-connect, and spark jobs related spark dataframe works good. But, when I trigger spark ml code, I am getting errors.

For example, after executing in the code: https://docs.databricks.com/_static/notebooks/gbt-regression.html

pipelineModel = pipeline.fit(train)
22/11/04 09:28:15 ERROR Instrumentation: java.io.IOException: unexpected exception type
	at java.io.ObjectStreamClass.throwMiscException(ObjectStreamClass.java:1750)
	at java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1280)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2196)
---------------------------
Caused by: java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
---------------------------
Caused by: java.lang.IllegalArgumentException: Illegal lambda deserialization
	at scala.runtime.LambdaDeserializer$.makeCallSite$1(LambdaDeserializer.scala:89)
	at scala.runtime.LambdaDeserializer$.deserializeLambda(LambdaDeserializer.scala:114)
	at scala.runtime.LambdaDeserialize.deserializeLambda(LambdaDeserialize.java:38)
---------------------------
py4j.protocol.Py4JJavaError: An error occurred while calling o806.fit.
: java.io.IOException: unexpected exception type
	at java.io.ObjectStreamClass.throwMiscException(ObjectStreamClass.java:1750)
	at java.io.ObjectStreamClass.invokeReadResolve(ObjectStreamClass.java:1280)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2196)
---------------------------
Caused by: java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
---------------------------
Caused by: java.lang.IllegalArgumentException: Illegal lambda deserialization
	at scala.runtime.LambdaDeserializer$.makeCallSite$1(LambdaDeserializer.scala:89)
	at scala.runtime.LambdaDeserializer$.deserializeLambda(LambdaDeserializer.scala:114)
	at scala.runtime.LambdaDeserialize.deserializeLambda(LambdaDeserialize.java:38)

Does anyone know how to fix it?

8 REPLIES 8

Kaniz_Fatma
Community Manager
Community Manager

Hi @Wooram Choi​, What's the DBR version you have been using?

Troy
New Contributor II

Hi @Kaniz Fatma​, I am using 10.4.

matt_chan
New Contributor III

I'm encountering the exact same problem. I'm also using databricks connect 10.4.12. Our models ran in production pipeline are doing fine because they are ran using the Databricks UI, and not databricks-connect. However, in our testing CI pipeline they are ran using databricks-connect in docker containers (using Concourse-CI). The codebase are the same. When I try to run the same code manually on my local machine connected to our cluster via databricks-connect I run into the same problem with Troy here.

In fact, I tried to run a very minimal random forest classifier and I STILL run into the same problem. Here are the code I use:

import numpy as np
import pandas as pd
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.session import SparkSession
 
 
spark = SparkSession.builder.getOrCreate()
data = spark.createDataFRame(
    pd.DataFrame({
        "feature_a": np.random.random(100),
        "feature_b": np.random.random(100),
        "feature_c": np.random.random(100),
        "label": np.random.choice([0, 1], 100)
})
vector_assembler = VectorAssembler(
    inputCols=[f"feature_{n}" for n in ["a", "b", "c"],
    outputCol="features",
)
parsed_data = (
    vector_assembler
    .transform(data)
    .drop(*[f"feature_{n}" for n in ["a", "b", "c"])
)
model = RandomForestClassifier()
model.fit(parsed_data)
# Error thrown here, very similar to Troy's.

I'm attaching my error output as well.

matt_chan
New Contributor III

@Kaniz Fatma​ any pointers at all?

Oliver_Floyd
Contributor

Hello,

Same problem, here in France.

@Kaniz Fatma​ Can we have some answers?

Kaniz_Fatma
Community Manager
Community Manager

Hi @Matt Chan​, @oliv vier​, and @Troy Holland​, Did you get a chance to see the Databricks Connect limitations? If not, please take a look:-

Databricks Connect does not support the following Databricks features and third-party platforms:

  • Unity Catalog.
  • Structured Streaming.
  • Running arbitrary code is not part of a Spark job on the remote cluster.
  • Native Scala, Python, and R APIs for Delta table operations (DeltaTable.forPath) are not supported. However, the SQL API (spark.sql(...)) with Delta Lake operations and the Spark API (for example, spark.read.load) on Delta tables are both supported.
  • Copy into.
  • Using SQL functions, Python or Scala UDFs, which are part of the server’s catalog. However, locally introduced Scala and Python UDFs work.
  • Apache Zeppelin 0.7.x and below.
  • Connecting to clusters with table access control.
  • Connecting to clusters with process isolation enabled (in other words, where spark.databricks.pyspark.enableProcessIsolation is set to true).
  • Delta CLONE SQL command.
  • Global temporary views.
  • Koalas.
  • CREATE TABLE table AS SELECT ...SQL commands do not always work. Instead, use spark.sql("SELECT ...").write.saveAsTable("table").
  • The following Databricks Utilities:
  • AWS Glue catalog

Good morning,

For information the error is not at all related to the limitations of databricks connect.

After various tests, in my case, it turns out that it is necessary to update the libraries of the venv used with databricks connect.

Here are the python library updates I made:

  • databricks-connect from 10.4.12 to 10.4.21
  • databricks-cli from 0.17.3 to 0.17.4
  • mlflow from 1.26.1 to 2.2.1
  • protobuf from 3.20.0 to 3.20.3

Note that I work with a 10.4 lts cluster

After these updates, the code example above works fine on intelliJ with databricks connect

Oliver_Floyd
Contributor

For information, upgrading python libraries does not resolve all problems.

This code works fine on databricks in a notebook :

import mlflow
model = mlflow.spark.load_model('runs:/cb6ff62587a0404cabeadd47e4c9408a/model')

Whereas it failed on intelliJ with databricks-connect

Did you have any solution ?

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!