09-02-2021 03:39 PM
I have set numBuckets and numBucketsArray for a group of columns to bin them into 5 buckets.
Unfortunately the number of buckets does not seem to be respected across all columns even though there is variation within them.
I have tried setting the relativeerror to 0.
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.QuantileDiscretizer...
Any idea why this is and how to solve it to force the number of buckets specified?
09-13-2021 06:19 PM
Thank you.
What I did was:
That fixed the issue! You can define your own splits which would work as well but the splits themselves were important in this case.
View solution in original post
09-03-2021 06:11 AM
QuantileDiscretizer does not guarantee the number of buckets afaik. Depending on your data you might get less buckets than asked.
Bucketizer however does, but you have to define your splits.
07-13-2022 08:13 PM
Can you explain a bit more?
never-displayed
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!