cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

how to avoid extra column after retry upon UnknownFieldException

cdn_yyz_yul
New Contributor III

 

With autoloader

.option("cloudFiles.schemaEvolutionMode", "addNewColumns")

 

I have done retry after getting

org.apache.spark.sql.catalyst.util.UnknownFieldException: [UNKNOWN_FIELD_EXCEPTION.NEW_FIELDS_IN_FILE] 
 Encountered unknown fields during parsing: 
 [test 1_2 Prime, test 1_2 Redundant, test 1_4 Prime, test 1_4 Redundant], which can be fixed by an automatic retry: true


The data is successfully written to the target delta table, new columns are added. However, the target delta table has an extra column:

timestamptest_1_1_primetest_1_1_redundanttest_1_2_primetest_1_2_redundanttest_1_3_primetest_1_3_redundanttest_1_4_primetest_1_4_redundant:string

 

Why the extra column is added? How to avoid it.

Note that before calling df.writeStream(), the code has used df.toDF() to rename the columns. 
In summary, the code has: readStream, rename column, writeStream. 

2 REPLIES 2

Hubert-Dudek
Esteemed Contributor III

But that extra column is exactly an unknown field from the schema (one really long name). For me, it's like incorrect JSON or smth (so a lot of fields end up in one column), but without seeing a sample of data, it's hard to guess. Personally I prefer to save json as VARIANT type and extract later (if it is json)

cdn_yyz_yul
New Contributor III

Hi @Hubert-Dudek 

the input is csv. 

readStream reads csv with 

.option("cloudFiles.inferColumnTypes", "true"). 

then, df.toDF( ) is called to rename the column name. The original csv header has space, that's why error message has "test 1_2 Prime". The rename changed it to test_1_2_prime.  
 
Finally, the df with renamed columns are written to delta sink.

---

I just noticed that the inferred column type is double. Databricks doc says:
For formats that don't encode data types (JSON, CSV, and XML), Auto Loader infers all columns as strings (including nested fields in JSON files)