I have followed the basic guide on using OneHotEncoder, matching the syntax exactly with my own data tables. The tables have enumerated string values. I first run a StringIndexer (both with and without handleInvalid set):
indexer = StringIndexer(
inputCols=indexed_in_categorical_columns,
outputCols=indexed_out_categorical_columns,
handleInvalid='keep',
)
train_magic = train.select(indexed_in_categorical_columns).dropna()
indexed_stuff = indexer.fit(train_magic)
indexed_stuff_df = indexed_stuff.transform(train_magic)
Then I use the columns (I've tried individual columns -- some work and some don't -- as well as combined columns) by encoding them, with and without the handleInvalid / dropLast set:
dumb_encoder = OneHotEncoder(
handleInvalid='keep',
dropLast=True,
inputCols=indexer.getOutputCols(),
outputCols=encoded_out_categorical_columns,
)
Then I run the encoder:
encoded_stuff_df = dumb_encoder.fit(indexed_stuff_df.select(indexed_out_categorical_columns))
The error from this step is:
IllegalArgumentException: requirement failed: Cannot have an empty string for name.
The output is useless, since it drops any information about the offending values. I've verified that the indexed columns have _no_ null values and I tried (as above) running dropna(), so it doesn't make sense. I checked the param maps on the indexer and encoder and all of them have name, so that's not the issue.
Any thoughts on how I can figure this out?