cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

OneHotEncoder fails with 'Cannot have an empty string for name'

Mr__E
Contributor II

I have followed the basic guide on using OneHotEncoder, matching the syntax exactly with my own data tables. The tables have enumerated string values. I first run a StringIndexer (both with and without handleInvalid set):

indexer = StringIndexer(
    inputCols=indexed_in_categorical_columns,
    outputCols=indexed_out_categorical_columns,
    handleInvalid='keep',
)
 
train_magic = train.select(indexed_in_categorical_columns).dropna()
indexed_stuff = indexer.fit(train_magic)
indexed_stuff_df = indexed_stuff.transform(train_magic)

Then I use the columns (I've tried individual columns -- some work and some don't -- as well as combined columns) by encoding them, with and without the handleInvalid / dropLast set:

dumb_encoder = OneHotEncoder(
    handleInvalid='keep',
    dropLast=True,
    inputCols=indexer.getOutputCols(),
    outputCols=encoded_out_categorical_columns,
)

Then I run the encoder:

encoded_stuff_df = dumb_encoder.fit(indexed_stuff_df.select(indexed_out_categorical_columns))

The error from this step is:

IllegalArgumentException: requirement failed: Cannot have an empty string for name.

The output is useless, since it drops any information about the offending values. I've verified that the indexed columns have _no_ null values and I tried (as above) running dropna(), so it doesn't make sense. I checked the param maps on the indexer and encoder and all of them have name, so that's not the issue.

Any thoughts on how I can figure this out?

1 ACCEPTED SOLUTION

Accepted Solutions

Mr__E
Contributor II

My earlier search for empty strings in the original table failed. So, I guess, what's going on is that despite running the encoder on indexed columns, the encoder validates against the original columns and ignores the 'handleInvalid' option, leading to the error. It's incredibly confusing. Here is a work around:

transform_empty = udf(lambda s: "NA" if s == "" else s, StringType())
for col in indexed_in_categorical_columns:
    train = train.withColumn(col, transform_empty(col))

View solution in original post

2 REPLIES 2

Mr__E
Contributor II

My earlier search for empty strings in the original table failed. So, I guess, what's going on is that despite running the encoder on indexed columns, the encoder validates against the original columns and ignores the 'handleInvalid' option, leading to the error. It's incredibly confusing. Here is a work around:

transform_empty = udf(lambda s: "NA" if s == "" else s, StringType())
for col in indexed_in_categorical_columns:
    train = train.withColumn(col, transform_empty(col))

EliasHaydar
New Contributor II

Nice catch ! Indeed, the error is misleading. In my case, it was a specific column that had a string with just whitespaces.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group