Mr__E
Contributor II

My earlier search for empty strings in the original table failed. So, I guess, what's going on is that despite running the encoder on indexed columns, the encoder validates against the original columns and ignores the 'handleInvalid' option, leading to the error. It's incredibly confusing. Here is a work around:

transform_empty = udf(lambda s: "NA" if s == "" else s, StringType())
for col in indexed_in_categorical_columns:
    train = train.withColumn(col, transform_empty(col))

View solution in original post