Unable to Use VectorAssembler in PySpark 3.5.0 Due to Whitelisting
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-20-2025 02:35 AM
Hi,
I am currently using PySpark version 3.5.0 on my Databricks cluster. Despite setting the required configuration using the command: spark.conf.set("spark.databricks.ml.whitelist", "true"), I am still encountering an issue while trying to use the VectorAssembler module from PySpark MLlib.
When I try to import it using the statement "from pyspark.ml.feature import VectorAssembler", I receive the following error:
Py4JError: An error occurred while calling None.org.apache.spark.ml.feature.VectorAssembler.
py4j.security.Py4JSecurityException: Constructor public org.apache.spark.ml.feature.VectorAssembler(java.lang.String) is not whitelisted.
It appears that the class is not whitelisted despite enabling the necessary configuration. Kindly assist in resolving this issue so that I can proceed with my Spark MLlib operations.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-20-2025 04:31 AM
Hi @Ritchie,
Can you run and validate outputs True:
print(spark.conf.get("spark.databricks.ml.whitelist"))
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-04-2025 04:11 AM - edited 03-04-2025 04:14 AM
Any updates on this?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-05-2025 02:06 AM
Hi all, I just run the code today out of curiosity and it just worked without any Exception. I used a single node cluster with DBR 15.4 ML (Spark 3.5.0).
Here is my code to confirm:
from pyspark.sql import Row
from pyspark.sql.functions import lit
from pyspark.ml.feature import VectorAssembler
# Create sample data
data = [
Row(feature1=1.0, feature2=2.0),
Row(feature1=3.0, feature2=4.0)
]
# Create DataFrame
df = spark.createDataFrame(data)
df = df.withColumn("pcainput_valgbp_avg_y1", lit(0))
# Initialize VectorAssembler
assembler = VectorAssembler(
inputCols=["feature1", "feature2", "pcainput_valgbp_avg_y1"],
outputCol="features"
)
# Transform the DataFrame
output_df = assembler.transform(df)
display(output_df)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-05-2025 02:09 AM
I submitted too fast. There is a similar thread mentioning that the similar error can be thrown with a Shared cluster.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-05-2025 02:11 AM
Yes, you can close this topic. It works on a ML-enabled machine. I wasn't aware that you have to pay extra for ML capabilities.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-06-2025 04:36 PM
Glad to hear it works for you now! The ML runtime has variety of preinstalled integrations such as MLflow, which provides ML lifecycle management, MLOps ... etc. Please explore them if you haven't done it already, to establish benefits of the extra 😉

