Can't use pyspark bucketizer

wise_centipede · ‎10-01-2025

As title suggests, I am struggling to use pyspark bucketizer as I repeatedly get the following error:

File <command-8301298062763331>, line 4
      2 from pyspark.ml.feature import Bucketizer
      3 spark = SparkSession.builder.appName("test").getOrCreate()
----> 4 bucketizer = Bucketizer()
File /databricks/python/lib/python3.12/site-packages/pyspark/ml/wrapper.py:87, in JavaWrapper._new_java_obj(java_class, *args)
     84 from pyspark.core.context import SparkContext
     86 sc = SparkContext._active_spark_context
---> 87 assert sc is not None
     89 java_obj = _jvm()
     90 for name in java_class.split("."):

Minimal reproducible example on serverless compute:

from pyspark.sql import SparkSession
from pyspark.ml.feature import Bucketizer
spark = SparkSession.builder.appName("test").getOrCreate()
bucketizer = Bucketizer()

nayan_wylde · ‎10-01-2025

Can you try to provide the mandatory parameters in the bucketizer. Even though in docs it is mentioned as optional. I see it works when the provide the parameters splits, inputcol and outputcol

from pyspark.sql import SparkSession
from pyspark.ml.feature import Bucketizer

# Initialize SparkSession with error handling
try:
    spark = SparkSession.builder.appName("BucketizerTest").getOrCreate()
    print(f"Spark version: {spark.version}")  # Verify SparkSession
except Exception as e:
    print(f"Failed to initialize SparkSession: {e}")
    raise

# Create a sample DataFrame
data = [(1, -0.5), (2, 0.0), (3, 1.5), (4, 3.0)]
df = spark.createDataFrame(data, ["id", "value"])

# Define splits for bucketizing
splits = [float("-inf"), 0.0, 1.0, 2.0, float("inf")]

# Initialize Bucketizer with required parameters
try:
    bucketizer = Bucketizer(
        splits=splits,
        inputCol="value",
        outputCol="bucket"
    )
    # Apply Bucketizer to DataFrame
    bucketed_df = bucketizer.transform(df)
    bucketed_df.show()
except Exception as e:
    print(f"Error with Bucketizer: {e}")
    raise

# Optional: Stop SparkSession (only if needed)
# spark.stop()

szymon_dybczak · ‎10-01-2025

Hi @wise_centipede ,

In your Serverless compute select Environment Version: 4 and it will work 🙂

With version below 4 I've got the same error as you:

And when I've upgrade serverless environment ot version 4 it works as expected 😉

View solution in original post