cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
cancel
Showing results for 
Search instead for 
Did you mean: 

OneHotEncoder fails with 'Cannot have an empty string for name'

Mr__E
Contributor II

I have followed the basic guide on using OneHotEncoder, matching the syntax exactly with my own data tables. The tables have enumerated string values. I first run a StringIndexer (both with and without handleInvalid set):

indexer = StringIndexer(
    inputCols=indexed_in_categorical_columns,
    outputCols=indexed_out_categorical_columns,
    handleInvalid='keep',
)
 
train_magic = train.select(indexed_in_categorical_columns).dropna()
indexed_stuff = indexer.fit(train_magic)
indexed_stuff_df = indexed_stuff.transform(train_magic)

Then I use the columns (I've tried individual columns -- some work and some don't -- as well as combined columns) by encoding them, with and without the handleInvalid / dropLast set:

dumb_encoder = OneHotEncoder(
    handleInvalid='keep',
    dropLast=True,
    inputCols=indexer.getOutputCols(),
    outputCols=encoded_out_categorical_columns,
)

Then I run the encoder:

encoded_stuff_df = dumb_encoder.fit(indexed_stuff_df.select(indexed_out_categorical_columns))

The error from this step is:

IllegalArgumentException: requirement failed: Cannot have an empty string for name.

The output is useless, since it drops any information about the offending values. I've verified that the indexed columns have _no_ null values and I tried (as above) running dropna(), so it doesn't make sense. I checked the param maps on the indexer and encoder and all of them have name, so that's not the issue.

Any thoughts on how I can figure this out?

1 ACCEPTED SOLUTION

Accepted Solutions

Mr__E
Contributor II

My earlier search for empty strings in the original table failed. So, I guess, what's going on is that despite running the encoder on indexed columns, the encoder validates against the original columns and ignores the 'handleInvalid' option, leading to the error. It's incredibly confusing. Here is a work around:

transform_empty = udf(lambda s: "NA" if s == "" else s, StringType())
for col in indexed_in_categorical_columns:
    train = train.withColumn(col, transform_empty(col))

View solution in original post

2 REPLIES 2

Mr__E
Contributor II

My earlier search for empty strings in the original table failed. So, I guess, what's going on is that despite running the encoder on indexed columns, the encoder validates against the original columns and ignores the 'handleInvalid' option, leading to the error. It's incredibly confusing. Here is a work around:

transform_empty = udf(lambda s: "NA" if s == "" else s, StringType())
for col in indexed_in_categorical_columns:
    train = train.withColumn(col, transform_empty(col))

EliasHaydar
New Contributor II

Nice catch ! Indeed, the error is misleading. In my case, it was a specific column that had a string with just whitespaces.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.