Databricks Community

Ritchie · ‎01-20-2025

Hi,
I am currently using PySpark version 3.5.0 on my Databricks cluster. Despite setting the required configuration using the command: spark.conf.set("spark.databricks.ml.whitelist", "true"), I am still encountering an issue while trying to use the VectorAssembler module from PySpark MLlib.

When I try to import it using the statement "from pyspark.ml.feature import VectorAssembler", I receive the following error:

Py4JError: An error occurred while calling None.org.apache.spark.ml.feature.VectorAssembler.
py4j.security.Py4JSecurityException: Constructor public org.apache.spark.ml.feature.VectorAssembler(java.lang.String) is not whitelisted.

It appears that the class is not whitelisted despite enabling the necessary configuration. Kindly assist in resolving this issue so that I can proceed with my Spark MLlib operations.

Alberto_Umana · ‎01-20-2025

Hi @Ritchie,

Can you run and validate outputs True:

print(spark.conf.get("spark.databricks.ml.whitelist"))

Niels80 · ‎03-04-2025

Any updates on this?

print(spark.conf.get("spark.databricks.ml.whitelist")) yields "true" after setting it.

py4j.security.Py4JSecurityException: Constructor public org.apache.spark.ml.feature.VectorAssembler(java.lang.String) is not whitelisted.

koji_kawamura · ‎03-05-2025

Hi all, I just run the code today out of curiosity and it just worked without any Exception. I used a single node cluster with DBR 15.4 ML (Spark 3.5.0).

Here is my code to confirm:

from pyspark.sql import Row
from pyspark.sql.functions import lit
from pyspark.ml.feature import VectorAssembler

# Create sample data
data = [
  Row(feature1=1.0, feature2=2.0),
  Row(feature1=3.0, feature2=4.0)
]

# Create DataFrame
df = spark.createDataFrame(data)

df = df.withColumn("pcainput_valgbp_avg_y1", lit(0))

# Initialize VectorAssembler
assembler = VectorAssembler(
  inputCols=["feature1", "feature2", "pcainput_valgbp_avg_y1"],
  outputCol="features"
)

# Transform the DataFrame
output_df = assembler.transform(df)
display(output_df)

koji_kawamura · ‎03-05-2025

I submitted too fast. There is a similar thread mentioning that the similar error can be thrown with a Shared cluster.

https://community.databricks.com/t5/data-engineering/constructor-public-org-apache-spark-ml-feature-...

Niels80 · ‎03-05-2025

Yes, you can close this topic. It works on a ML-enabled machine. I wasn't aware that you have to pay extra for ML capabilities.

koji_kawamura · ‎03-06-2025

Glad to hear it works for you now! The ML runtime has variety of preinstalled integrations such as MLflow, which provides ML lifecycle management, MLOps ... etc. Please explore them if you haven't done it already, to establish benefits of the extra 😉