cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

Unable to Use VectorAssembler in PySpark 3.5.0 Due to Whitelisting

Ritchie
New Contributor

Hi,
I am currently using PySpark version 3.5.0 on my Databricks cluster. Despite setting the required configuration using the command: spark.conf.set("spark.databricks.ml.whitelist", "true"), I am still encountering an issue while trying to use the VectorAssembler module from PySpark MLlib.

When I try to import it using the statement "from pyspark.ml.feature import VectorAssembler", I receive the following error:

Py4JError: An error occurred while calling None.org.apache.spark.ml.feature.VectorAssembler.
py4j.security.Py4JSecurityException: Constructor public org.apache.spark.ml.feature.VectorAssembler(java.lang.String) is not whitelisted.

It appears that the class is not whitelisted despite enabling the necessary configuration. Kindly assist in resolving this issue so that I can proceed with my Spark MLlib operations.

6 REPLIES 6

Alberto_Umana
Databricks Employee
Databricks Employee

Hi @Ritchie,

Can you run and validate outputs True:

print(spark.conf.get("spark.databricks.ml.whitelist"))

Niels80
New Contributor II

Any updates on this?

print(spark.conf.get("spark.databricks.ml.whitelist"))   yields "true" after setting it.
 
py4j.security.Py4JSecurityException: Constructor public org.apache.spark.ml.feature.VectorAssembler(java.lang.String) is not whitelisted.
 
 

koji_kawamura
Databricks Employee
Databricks Employee

Hi all, I just run the code today out of curiosity and it just worked without any Exception. I used a single node cluster with DBR 15.4 ML (Spark 3.5.0).

Here is my code to confirm:

from pyspark.sql import Row
from pyspark.sql.functions import lit
from pyspark.ml.feature import VectorAssembler

# Create sample data
data = [
Row(feature1=1.0, feature2=2.0),
Row(feature1=3.0, feature2=4.0)
]

# Create DataFrame
df = spark.createDataFrame(data)

df = df.withColumn("pcainput_valgbp_avg_y1", lit(0))

# Initialize VectorAssembler
assembler = VectorAssembler(
inputCols=["feature1", "feature2", "pcainput_valgbp_avg_y1"],
outputCol="features"
)

# Transform the DataFrame
output_df = assembler.transform(df)
display(output_df)

koji_kawamura_0-1741168207126.png

 

I submitted too fast. There is a similar thread mentioning that the similar error can be thrown with a Shared cluster.

https://community.databricks.com/t5/data-engineering/constructor-public-org-apache-spark-ml-feature-...

Yes, you can close this topic. It works on a ML-enabled machine. I wasn't aware that you have to pay extra for ML capabilities. 

koji_kawamura
Databricks Employee
Databricks Employee

Glad to hear it works for you now! The ML runtime has variety of preinstalled integrations such as MLflow, which provides ML lifecycle management, MLOps ... etc. Please explore them if you haven't done it already, to establish benefits of the extra 😉

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now