Databricks Community

Sam · ‎09-02-2021

I have set numBuckets and numBucketsArray for a group of columns to bin them into 5 buckets.

Unfortunately the number of buckets does not seem to be respected across all columns even though there is variation within them.

I have tried setting the relativeerror to 0.

Any idea why this is and how to solve it to force the number of buckets specified?

Sam · ‎09-13-2021

Thank you.

What I did was:

Apply QuntileBucketizer to Non-Zeros and specified a very small value (bottom 1%) to capture the lower bucket including zeroes.

That fixed the issue! You can define your own splits which would work as well but the splits themselves were important in this case.

-werners- · ‎09-03-2021

QuantileDiscretizer does not guarantee the number of buckets afaik. Depending on your data you might get less buckets than asked.

Bucketizer however does, but you have to define your splits.

Sam · ‎09-13-2021

Thank you.

What I did was:

Apply QuntileBucketizer to Non-Zeros and specified a very small value (bottom 1%) to capture the lower bucket including zeroes.

That fixed the issue! You can define your own splits which would work as well but the splits themselves were important in this case.

Hemant · ‎07-13-2022

Can you explain a bit more?

Hemant Soni

QuantileDiscretizer not respecting NumBuckets