cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

Can't use pyspark bucketizer

wise_centipede
New Contributor II

As title suggests, I am struggling to use pyspark bucketizer as I repeatedly get the following error:

File <command-8301298062763331>, line 4
      2 from pyspark.ml.feature import Bucketizer
      3 spark = SparkSession.builder.appName("test").getOrCreate()
----> 4 bucketizer = Bucketizer()
File /databricks/python/lib/python3.12/site-packages/pyspark/ml/wrapper.py:87, in JavaWrapper._new_java_obj(java_class, *args)
     84 from pyspark.core.context import SparkContext
     86 sc = SparkContext._active_spark_context
---> 87 assert sc is not None
     89 java_obj = _jvm()
     90 for name in java_class.split("."):
Minimal reproducible example on serverless compute:
from pyspark.sql import SparkSession
from pyspark.ml.feature import Bucketizer
spark = SparkSession.builder.appName("test").getOrCreate()
bucketizer = Bucketizer()

 

1 ACCEPTED SOLUTION

Accepted Solutions

szymon_dybczak
Esteemed Contributor III

 

Hi @wise_centipede ,

In your Serverless compute select Environment Version: 4 and it will work 🙂

szymon_dybczak_0-1759342015338.png

szymon_dybczak_1-1759342093569.png

 

With version below 4 I've got the same error as you:

szymon_dybczak_2-1759342095917.png

And when I've upgrade serverless environment ot version 4 it works as expected 😉

 

szymon_dybczak_3-1759342170033.png

 

View solution in original post

2 REPLIES 2

nayan_wylde
Honored Contributor III

Can you try to provide the mandatory parameters in the bucketizer. Even though in docs it is mentioned as optional. I see it works when the provide the parameters splits, inputcol and outputcol

from pyspark.sql import SparkSession
from pyspark.ml.feature import Bucketizer

# Initialize SparkSession with error handling
try:
    spark = SparkSession.builder.appName("BucketizerTest").getOrCreate()
    print(f"Spark version: {spark.version}")  # Verify SparkSession
except Exception as e:
    print(f"Failed to initialize SparkSession: {e}")
    raise

# Create a sample DataFrame
data = [(1, -0.5), (2, 0.0), (3, 1.5), (4, 3.0)]
df = spark.createDataFrame(data, ["id", "value"])

# Define splits for bucketizing
splits = [float("-inf"), 0.0, 1.0, 2.0, float("inf")]

# Initialize Bucketizer with required parameters
try:
    bucketizer = Bucketizer(
        splits=splits,
        inputCol="value",
        outputCol="bucket"
    )
    # Apply Bucketizer to DataFrame
    bucketed_df = bucketizer.transform(df)
    bucketed_df.show()
except Exception as e:
    print(f"Error with Bucketizer: {e}")
    raise

# Optional: Stop SparkSession (only if needed)
# spark.stop()

szymon_dybczak
Esteemed Contributor III

 

Hi @wise_centipede ,

In your Serverless compute select Environment Version: 4 and it will work 🙂

szymon_dybczak_0-1759342015338.png

szymon_dybczak_1-1759342093569.png

 

With version below 4 I've got the same error as you:

szymon_dybczak_2-1759342095917.png

And when I've upgrade serverless environment ot version 4 it works as expected 😉

 

szymon_dybczak_3-1759342170033.png

 

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now