cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

OneHotEncoder fails with 'Cannot have an empty string for name'

Mr__E
Contributor II

I have followed the basic guide on using OneHotEncoder, matching the syntax exactly with my own data tables. The tables have enumerated string values. I first run a StringIndexer (both with and without handleInvalid set):

indexer = StringIndexer(
    inputCols=indexed_in_categorical_columns,
    outputCols=indexed_out_categorical_columns,
    handleInvalid='keep',
)
 
train_magic = train.select(indexed_in_categorical_columns).dropna()
indexed_stuff = indexer.fit(train_magic)
indexed_stuff_df = indexed_stuff.transform(train_magic)

Then I use the columns (I've tried individual columns -- some work and some don't -- as well as combined columns) by encoding them, with and without the handleInvalid / dropLast set:

dumb_encoder = OneHotEncoder(
    handleInvalid='keep',
    dropLast=True,
    inputCols=indexer.getOutputCols(),
    outputCols=encoded_out_categorical_columns,
)

Then I run the encoder:

encoded_stuff_df = dumb_encoder.fit(indexed_stuff_df.select(indexed_out_categorical_columns))

The error from this step is:

IllegalArgumentException: requirement failed: Cannot have an empty string for name.

The output is useless, since it drops any information about the offending values. I've verified that the indexed columns have _no_ null values and I tried (as above) running dropna(), so it doesn't make sense. I checked the param maps on the indexer and encoder and all of them have name, so that's not the issue.

Any thoughts on how I can figure this out?

1 ACCEPTED SOLUTION

Accepted Solutions

Mr__E
Contributor II

My earlier search for empty strings in the original table failed. So, I guess, what's going on is that despite running the encoder on indexed columns, the encoder validates against the original columns and ignores the 'handleInvalid' option, leading to the error. It's incredibly confusing. Here is a work around:

transform_empty = udf(lambda s: "NA" if s == "" else s, StringType())
for col in indexed_in_categorical_columns:
    train = train.withColumn(col, transform_empty(col))

View solution in original post

2 REPLIES 2

Mr__E
Contributor II

My earlier search for empty strings in the original table failed. So, I guess, what's going on is that despite running the encoder on indexed columns, the encoder validates against the original columns and ignores the 'handleInvalid' option, leading to the error. It's incredibly confusing. Here is a work around:

transform_empty = udf(lambda s: "NA" if s == "" else s, StringType())
for col in indexed_in_categorical_columns:
    train = train.withColumn(col, transform_empty(col))

EliasHaydar
New Contributor II

Nice catch ! Indeed, the error is misleading. In my case, it was a specific column that had a string with just whitespaces.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!