cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

QuantileDiscretizer not respecting NumBuckets

Sam
New Contributor III

I have set numBuckets and numBucketsArray for a group of columns to bin them into 5 buckets.

Unfortunately the number of buckets does not seem to be respected across all columns even though there is variation within them.

I have tried setting the relativeerror to 0.

https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.QuantileDiscretizer...

Any idea why this is and how to solve it to force the number of buckets specified?

1 ACCEPTED SOLUTION

Accepted Solutions

Sam
New Contributor III

Thank you.

What I did was:

  1. Apply QuntileBucketizer to Non-Zeros and specified a very small value (bottom 1%) to capture the lower bucket including zeroes.

That fixed the issue! You can define your own splits which would work as well but the splits themselves were important in this case.

View solution in original post

4 REPLIES 4

Kaniz
Community Manager
Community Manager

Hi @ Sam ! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers on the Forum have an answer to your question first. Or else I will follow up shortly with a response.

-werners-
Esteemed Contributor III

QuantileDiscretizer does not guarantee the number of buckets afaik. Depending on your data you might get less buckets than asked.

Bucketizer however does, but you have to define your splits.

Sam
New Contributor III

Thank you.

What I did was:

  1. Apply QuntileBucketizer to Non-Zeros and specified a very small value (bottom 1%) to capture the lower bucket including zeroes.

That fixed the issue! You can define your own splits which would work as well but the splits themselves were important in this case.

Hemant
Valued Contributor II

Can you explain a bit more?​

Hemant Soni
Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.