- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-29-2022 06:44 AM
I have followed the basic guide on using OneHotEncoder, matching the syntax exactly with my own data tables. The tables have enumerated string values. I first run a StringIndexer (both with and without handleInvalid set):
indexer = StringIndexer(
inputCols=indexed_in_categorical_columns,
outputCols=indexed_out_categorical_columns,
handleInvalid='keep',
)
train_magic = train.select(indexed_in_categorical_columns).dropna()
indexed_stuff = indexer.fit(train_magic)
indexed_stuff_df = indexed_stuff.transform(train_magic)
Then I use the columns (I've tried individual columns -- some work and some don't -- as well as combined columns) by encoding them, with and without the handleInvalid / dropLast set:
dumb_encoder = OneHotEncoder(
handleInvalid='keep',
dropLast=True,
inputCols=indexer.getOutputCols(),
outputCols=encoded_out_categorical_columns,
)
Then I run the encoder:
encoded_stuff_df = dumb_encoder.fit(indexed_stuff_df.select(indexed_out_categorical_columns))
The error from this step is:
IllegalArgumentException: requirement failed: Cannot have an empty string for name.
The output is useless, since it drops any information about the offending values. I've verified that the indexed columns have _no_ null values and I tried (as above) running dropna(), so it doesn't make sense. I checked the param maps on the indexer and encoder and all of them have name, so that's not the issue.
Any thoughts on how I can figure this out?
- Labels:
-
Databricks Training
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-29-2022 07:16 AM
My earlier search for empty strings in the original table failed. So, I guess, what's going on is that despite running the encoder on indexed columns, the encoder validates against the original columns and ignores the 'handleInvalid' option, leading to the error. It's incredibly confusing. Here is a work around:
transform_empty = udf(lambda s: "NA" if s == "" else s, StringType())
for col in indexed_in_categorical_columns:
train = train.withColumn(col, transform_empty(col))
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-29-2022 07:16 AM
My earlier search for empty strings in the original table failed. So, I guess, what's going on is that despite running the encoder on indexed columns, the encoder validates against the original columns and ignores the 'handleInvalid' option, leading to the error. It's incredibly confusing. Here is a work around:
transform_empty = udf(lambda s: "NA" if s == "" else s, StringType())
for col in indexed_in_categorical_columns:
train = train.withColumn(col, transform_empty(col))
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-16-2022 05:44 AM
Nice catch ! Indeed, the error is misleading. In my case, it was a specific column that had a string with just whitespaces.

