cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

QuantileDiscretizer not respecting NumBuckets

Sam
New Contributor III

I have set numBuckets and numBucketsArray for a group of columns to bin them into 5 buckets.

Unfortunately the number of buckets does not seem to be respected across all columns even though there is variation within them.

I have tried setting the relativeerror to 0.

https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.QuantileDiscretizer...

Any idea why this is and how to solve it to force the number of buckets specified?

1 ACCEPTED SOLUTION

Accepted Solutions

Sam
New Contributor III

Thank you.

What I did was:

  1. Apply QuntileBucketizer to Non-Zeros and specified a very small value (bottom 1%) to capture the lower bucket including zeroes.

That fixed the issue! You can define your own splits which would work as well but the splits themselves were important in this case.

View solution in original post

4 REPLIES 4

Kaniz
Community Manager
Community Manager

Hi @ Sam ! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers on the Forum have an answer to your question first. Or else I will follow up shortly with a response.

-werners-
Esteemed Contributor III

QuantileDiscretizer does not guarantee the number of buckets afaik. Depending on your data you might get less buckets than asked.

Bucketizer however does, but you have to define your splits.

Sam
New Contributor III

Thank you.

What I did was:

  1. Apply QuntileBucketizer to Non-Zeros and specified a very small value (bottom 1%) to capture the lower bucket including zeroes.

That fixed the issue! You can define your own splits which would work as well but the splits themselves were important in this case.

Hemant
Valued Contributor II

Can you explain a bit more?​

Hemant Soni