09-02-2021 03:39 PM
I have set numBuckets and numBucketsArray for a group of columns to bin them into 5 buckets.
Unfortunately the number of buckets does not seem to be respected across all columns even though there is variation within them.
I have tried setting the relativeerror to 0.
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.QuantileDiscretizer...
Any idea why this is and how to solve it to force the number of buckets specified?
09-13-2021 06:19 PM
Thank you.
What I did was:
That fixed the issue! You can define your own splits which would work as well but the splits themselves were important in this case.
View solution in original post
09-02-2021 07:53 PM
Hi @ Sam ! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers on the Forum have an answer to your question first. Or else I will follow up shortly with a response.
09-03-2021 06:11 AM
QuantileDiscretizer does not guarantee the number of buckets afaik. Depending on your data you might get less buckets than asked.
Bucketizer however does, but you have to define your splits.
07-13-2022 08:13 PM
Can you explain a bit more?
never-displayed