topic Re: how to avoid extra column after retry upon UnknownFieldException in Data Engineering

how to avoid extra column after retry upon UnknownFieldException

cdn_yyz_yul — Tue, 02 Dec 2025 16:30:40 GMT

With autoloader

.option("cloudFiles.schemaEvolutionMode", "addNewColumns")

I have done retry after getting org.apache.spark.sql.catalyst.util.UnknownFieldException: [UNKNOWN_FIELD_EXCEPTION.NEW_FIELDS_IN_FILE] Encountered unknown fields during parsing: [test 1_2 Prime, test 1_2 Redundant, test 1_4 Prime, test 1_4 Redundant], which can be fixed by an automatic retry: true The data is successfully written to the target delta table, new columns are added. However, the target delta table has an extra column: timestamptest_1_1_primetest_1_1_redundanttest_1_2_primetest_1_2_redundanttest_1_3_primetest_1_3_redundanttest_1_4_primetest_1_4_redundant:string

Why the extra column is added? How to avoid it.

Note that before calling df.writeStream(), the code has used df.toDF() to rename the columns.
In summary, the code has: readStream, rename column, writeStream.

Re: how to avoid extra column after retry upon UnknownFieldException

Hubert-Dudek — Tue, 02 Dec 2025 16:42:38 GMT

But that extra column is exactly an unknown field from the schema (one really long name). For me, it's like incorrect JSON or smth (so a lot of fields end up in one column), but without seeing a sample of data, it's hard to guess. Personally I prefer to save json as VARIANT type and extract later (if it is json)

Re: how to avoid extra column after retry upon UnknownFieldException

cdn_yyz_yul — Tue, 02 Dec 2025 17:09:16 GMT

Hi @Hubert-Dudek

the input is csv.

readStream reads csv with

.option("cloudFiles.inferColumnTypes", "true").

then, df.toDF( ) is called to rename the column name. The original csv header has space, that's why error message has "test 1_2 Prime". The rename changed it to test_1_2_prime.

Finally, the df with renamed columns are written to delta sink.

---

I just noticed that the inferred column type is double. Databricks doc says:

For formats that don't encode data types (JSON, CSV, and XML), Auto Loader infers all columns as strings (including nested fields in JSON files)

Re: how to avoid extra column after retry upon UnknownFieldException

cdn_yyz_yul — Wed, 03 Dec 2025 12:39:31 GMT

Even though the input is csv, it has indeed some rows mis-formatted.