12-02-2025 08:30 AM
With autoloader
.option("cloudFiles.schemaEvolutionMode", "addNewColumns")
I have done retry after getting
org.apache.spark.sql.catalyst.util.UnknownFieldException: [UNKNOWN_FIELD_EXCEPTION.NEW_FIELDS_IN_FILE]
Encountered unknown fields during parsing:
[test 1_2 Prime, test 1_2 Redundant, test 1_4 Prime, test 1_4 Redundant], which can be fixed by an automatic retry: true
The data is successfully written to the target delta table, new columns are added. However, the target delta table has an extra column:
timestamptest_1_1_primetest_1_1_redundanttest_1_2_primetest_1_2_redundanttest_1_3_primetest_1_3_redundanttest_1_4_primetest_1_4_redundant:string
Why the extra column is added? How to avoid it.
Note that before calling df.writeStream(), the code has used df.toDF() to rename the columns.
In summary, the code has: readStream, rename column, writeStream.
12-02-2025 08:42 AM
But that extra column is exactly an unknown field from the schema (one really long name). For me, it's like incorrect JSON or smth (so a lot of fields end up in one column), but without seeing a sample of data, it's hard to guess. Personally I prefer to save json as VARIANT type and extract later (if it is json)
12-02-2025 08:42 AM
But that extra column is exactly an unknown field from the schema (one really long name). For me, it's like incorrect JSON or smth (so a lot of fields end up in one column), but without seeing a sample of data, it's hard to guess. Personally I prefer to save json as VARIANT type and extract later (if it is json)
12-03-2025 04:39 AM
Even though the input is csv, it has indeed some rows mis-formatted.
12-02-2025 09:09 AM
the input is csv.
readStream reads csv with