Databricks Community

Mr__E · ‎03-29-2022

I have followed the basic guide on using OneHotEncoder, matching the syntax exactly with my own data tables. The tables have enumerated string values. I first run a StringIndexer (both with and without handleInvalid set):

indexer = StringIndexer(
    inputCols=indexed_in_categorical_columns,
    outputCols=indexed_out_categorical_columns,
    handleInvalid='keep',
)
 
train_magic = train.select(indexed_in_categorical_columns).dropna()
indexed_stuff = indexer.fit(train_magic)
indexed_stuff_df = indexed_stuff.transform(train_magic)

Then I use the columns (I've tried individual columns -- some work and some don't -- as well as combined columns) by encoding them, with and without the handleInvalid / dropLast set:

dumb_encoder = OneHotEncoder(
    handleInvalid='keep',
    dropLast=True,
    inputCols=indexer.getOutputCols(),
    outputCols=encoded_out_categorical_columns,
)

Then I run the encoder:

encoded_stuff_df = dumb_encoder.fit(indexed_stuff_df.select(indexed_out_categorical_columns))

The error from this step is:

IllegalArgumentException: requirement failed: Cannot have an empty string for name.

The output is useless, since it drops any information about the offending values. I've verified that the indexed columns have _no_ null values and I tried (as above) running dropna(), so it doesn't make sense. I checked the param maps on the indexer and encoder and all of them have name, so that's not the issue.

Any thoughts on how I can figure this out?

Mr__E · ‎03-29-2022

My earlier search for empty strings in the original table failed. So, I guess, what's going on is that despite running the encoder on indexed columns, the encoder validates against the original columns and ignores the 'handleInvalid' option, leading to the error. It's incredibly confusing. Here is a work around:

transform_empty = udf(lambda s: "NA" if s == "" else s, StringType())
for col in indexed_in_categorical_columns:
    train = train.withColumn(col, transform_empty(col))

View solution in original post

Mr__E · ‎03-29-2022

My earlier search for empty strings in the original table failed. So, I guess, what's going on is that despite running the encoder on indexed columns, the encoder validates against the original columns and ignores the 'handleInvalid' option, leading to the error. It's incredibly confusing. Here is a work around:

transform_empty = udf(lambda s: "NA" if s == "" else s, StringType())
for col in indexed_in_categorical_columns:
    train = train.withColumn(col, transform_empty(col))