matt_chan
New Contributor III

I'm encountering the exact same problem. I'm also using databricks connect 10.4.12. Our models ran in production pipeline are doing fine because they are ran using the Databricks UI, and not databricks-connect. However, in our testing CI pipeline they are ran using databricks-connect in docker containers (using Concourse-CI). The codebase are the same. When I try to run the same code manually on my local machine connected to our cluster via databricks-connect I run into the same problem with Troy here.

In fact, I tried to run a very minimal random forest classifier and I STILL run into the same problem. Here are the code I use:

import numpy as np
import pandas as pd
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.session import SparkSession
 
 
spark = SparkSession.builder.getOrCreate()
data = spark.createDataFRame(
    pd.DataFrame({
        "feature_a": np.random.random(100),
        "feature_b": np.random.random(100),
        "feature_c": np.random.random(100),
        "label": np.random.choice([0, 1], 100)
})
vector_assembler = VectorAssembler(
    inputCols=[f"feature_{n}" for n in ["a", "b", "c"],
    outputCol="features",
)
parsed_data = (
    vector_assembler
    .transform(data)
    .drop(*[f"feature_{n}" for n in ["a", "b", "c"])
)
model = RandomForestClassifier()
model.fit(parsed_data)
# Error thrown here, very similar to Troy's.

I'm attaching my error output as well.