topic Can't use pyspark bucketizer in Machine Learning

Can't use pyspark bucketizer

wise_centipede — Wed, 01 Oct 2025 17:25:12 GMT

As title suggests, I am struggling to use pyspark bucketizer as I repeatedly get the following error:

File <command-8301298062763331>, line 4 2 from pyspark.ml.feature import Bucketizer 3 spark = SparkSession.builder.appName("test").getOrCreate() ----> 4 bucketizer = Bucketizer() File /databricks/python/lib/python3.12/site-packages/pyspark/ml/wrapper.py:87, in JavaWrapper._new_java_obj(java_class, *args) 84 from pyspark.core.context import SparkContext 86 sc = SparkContext._active_spark_context ---> 87 assert sc is not None 89 java_obj = _jvm() 90 for name in java_class.split("."):

Minimal reproducible example on serverless compute:

from pyspark.sql import SparkSession from pyspark.ml.feature import Bucketizer spark = SparkSession.builder.appName("test").getOrCreate() bucketizer = Bucketizer()

Re: Can't use pyspark bucketizer

nayan_wylde — Wed, 01 Oct 2025 17:48:16 GMT

Can you try to provide the mandatory parameters in the bucketizer. Even though in docs it is mentioned as optional. I see it works when the provide the parameters splits, inputcol and outputcol

from pyspark.sql import SparkSession from pyspark.ml.feature import Bucketizer # Initialize SparkSession with error handling try: spark = SparkSession.builder.appName("BucketizerTest").getOrCreate() print(f"Spark version: {spark.version}") # Verify SparkSession except Exception as e: print(f"Failed to initialize SparkSession: {e}") raise # Create a sample DataFrame data = [(1, -0.5), (2, 0.0), (3, 1.5), (4, 3.0)] df = spark.createDataFrame(data, ["id", "value"]) # Define splits for bucketizing splits = [float("-inf"), 0.0, 1.0, 2.0, float("inf")] # Initialize Bucketizer with required parameters try: bucketizer = Bucketizer( splits=splits, inputCol="value", outputCol="bucket" ) # Apply Bucketizer to DataFrame bucketed_df = bucketizer.transform(df) bucketed_df.show() except Exception as e: print(f"Error with Bucketizer: {e}") raise # Optional: Stop SparkSession (only if needed) # spark.stop()

Re: Can't use pyspark bucketizer

szymon_dybczak — Wed, 01 Oct 2025 18:10:02 GMT

Hi @wise_centipede ,

In your Serverless compute select Environment Version: 4 and it will work 🙂

With version below 4 I've got the same error as you:

And when I've upgrade serverless environment ot version 4 it works as expected 😉