- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-26-2022 03:59 PM
I'm encountering the exact same problem. I'm also using databricks connect 10.4.12. Our models ran in production pipeline are doing fine because they are ran using the Databricks UI, and not databricks-connect. However, in our testing CI pipeline they are ran using databricks-connect in docker containers (using Concourse-CI). The codebase are the same. When I try to run the same code manually on my local machine connected to our cluster via databricks-connect I run into the same problem with Troy here.
In fact, I tried to run a very minimal random forest classifier and I STILL run into the same problem. Here are the code I use:
import numpy as np
import pandas as pd
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.session import SparkSession
spark = SparkSession.builder.getOrCreate()
data = spark.createDataFRame(
pd.DataFrame({
"feature_a": np.random.random(100),
"feature_b": np.random.random(100),
"feature_c": np.random.random(100),
"label": np.random.choice([0, 1], 100)
})
vector_assembler = VectorAssembler(
inputCols=[f"feature_{n}" for n in ["a", "b", "c"],
outputCol="features",
)
parsed_data = (
vector_assembler
.transform(data)
.drop(*[f"feature_{n}" for n in ["a", "b", "c"])
)
model = RandomForestClassifier()
model.fit(parsed_data)
# Error thrown here, very similar to Troy's.I'm attaching my error output as well.