- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-01-2025 10:25 AM
As title suggests, I am struggling to use pyspark bucketizer as I repeatedly get the following error:
File <command-8301298062763331>, line 4
2 from pyspark.ml.feature import Bucketizer
3 spark = SparkSession.builder.appName("test").getOrCreate()
----> 4 bucketizer = Bucketizer()
File /databricks/python/lib/python3.12/site-packages/pyspark/ml/wrapper.py:87, in JavaWrapper._new_java_obj(java_class, *args)
84 from pyspark.core.context import SparkContext
86 sc = SparkContext._active_spark_context
---> 87 assert sc is not None
89 java_obj = _jvm()
90 for name in java_class.split("."):from pyspark.sql import SparkSession
from pyspark.ml.feature import Bucketizer
spark = SparkSession.builder.appName("test").getOrCreate()
bucketizer = Bucketizer()
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-01-2025 10:48 AM
Can you try to provide the mandatory parameters in the bucketizer. Even though in docs it is mentioned as optional. I see it works when the provide the parameters splits, inputcol and outputcol
from pyspark.sql import SparkSession
from pyspark.ml.feature import Bucketizer
# Initialize SparkSession with error handling
try:
spark = SparkSession.builder.appName("BucketizerTest").getOrCreate()
print(f"Spark version: {spark.version}") # Verify SparkSession
except Exception as e:
print(f"Failed to initialize SparkSession: {e}")
raise
# Create a sample DataFrame
data = [(1, -0.5), (2, 0.0), (3, 1.5), (4, 3.0)]
df = spark.createDataFrame(data, ["id", "value"])
# Define splits for bucketizing
splits = [float("-inf"), 0.0, 1.0, 2.0, float("inf")]
# Initialize Bucketizer with required parameters
try:
bucketizer = Bucketizer(
splits=splits,
inputCol="value",
outputCol="bucket"
)
# Apply Bucketizer to DataFrame
bucketed_df = bucketizer.transform(df)
bucketed_df.show()
except Exception as e:
print(f"Error with Bucketizer: {e}")
raise
# Optional: Stop SparkSession (only if needed)
# spark.stop()
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-01-2025 11:07 AM - edited 10-01-2025 11:10 AM
Hi @wise_centipede ,
In your Serverless compute select Environment Version: 4 and it will work 🙂
With version below 4 I've got the same error as you:
And when I've upgrade serverless environment ot version 4 it works as expected 😉