<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: how to avoid extra column after retry upon UnknownFieldException in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/how-to-avoid-extra-column-after-retry-upon-unknownfieldexception/m-p/140898#M51568</link>
    <description>&lt;P&gt;But that extra column is exactly an unknown field from the schema (one really long name). For me, it's like incorrect JSON or smth (so a lot of fields end up in one column), but without seeing a sample of data, it's hard to guess. Personally I prefer to save json as VARIANT type and extract later (if it is json)&lt;/P&gt;</description>
    <pubDate>Tue, 02 Dec 2025 16:42:38 GMT</pubDate>
    <dc:creator>Hubert-Dudek</dc:creator>
    <dc:date>2025-12-02T16:42:38Z</dc:date>
    <item>
      <title>how to avoid extra column after retry upon UnknownFieldException</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-avoid-extra-column-after-retry-upon-unknownfieldexception/m-p/140897#M51567</link>
      <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;With autoloader&lt;/P&gt;&lt;LI-CODE lang="python"&gt;.option("cloudFiles.schemaEvolutionMode", "addNewColumns")&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;I have done retry after getting

org.apache.spark.sql.catalyst.util.UnknownFieldException: [UNKNOWN_FIELD_EXCEPTION.NEW_FIELDS_IN_FILE] 
 Encountered unknown fields during parsing: 
 [test 1_2 Prime, test 1_2 Redundant, test 1_4 Prime, test 1_4 Redundant], which can be fixed by an automatic retry: true


The data is successfully written to the target delta table, new columns are added. However, the target delta table has an extra column:

timestamptest_1_1_primetest_1_1_redundanttest_1_2_primetest_1_2_redundanttest_1_3_primetest_1_3_redundanttest_1_4_primetest_1_4_redundant:string
&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Why the extra column is added? How to avoid it.&lt;/P&gt;&lt;P&gt;Note that before calling df.writeStream(), the code has used df.toDF() to rename the columns.&amp;nbsp;&lt;BR /&gt;In summary, the code has: readStream, rename column, writeStream.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 02 Dec 2025 16:30:40 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-avoid-extra-column-after-retry-upon-unknownfieldexception/m-p/140897#M51567</guid>
      <dc:creator>cdn_yyz_yul</dc:creator>
      <dc:date>2025-12-02T16:30:40Z</dc:date>
    </item>
    <item>
      <title>Re: how to avoid extra column after retry upon UnknownFieldException</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-avoid-extra-column-after-retry-upon-unknownfieldexception/m-p/140898#M51568</link>
      <description>&lt;P&gt;But that extra column is exactly an unknown field from the schema (one really long name). For me, it's like incorrect JSON or smth (so a lot of fields end up in one column), but without seeing a sample of data, it's hard to guess. Personally I prefer to save json as VARIANT type and extract later (if it is json)&lt;/P&gt;</description>
      <pubDate>Tue, 02 Dec 2025 16:42:38 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-avoid-extra-column-after-retry-upon-unknownfieldexception/m-p/140898#M51568</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2025-12-02T16:42:38Z</dc:date>
    </item>
    <item>
      <title>Re: how to avoid extra column after retry upon UnknownFieldException</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-avoid-extra-column-after-retry-upon-unknownfieldexception/m-p/140905#M51570</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/25346"&gt;@Hubert-Dudek&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;the input is csv.&amp;nbsp;&lt;/P&gt;&lt;P&gt;readStream reads csv with&amp;nbsp;&lt;/P&gt;&lt;DIV&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;SPAN&gt;option&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;cloudFiles.inferColumnTypes&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;true&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;).&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&lt;SPAN&gt;then, df.toDF( ) is called to rename the column name. The original csv header has space, that's why error message has "test 1_2 Prime". The rename changed it to test_1_2_prime.&amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&lt;SPAN&gt;Finally, the df with renamed columns are written to delta sink.&lt;BR /&gt;&lt;BR /&gt;---&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&lt;SPAN&gt;I just noticed that the inferred column type is double. Databricks doc says: &lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&lt;SPAN&gt;For formats that don't encode data types (JSON, CSV, and XML), Auto Loader infers all columns as strings (including nested fields in JSON files)&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/DIV&gt;</description>
      <pubDate>Tue, 02 Dec 2025 17:09:16 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-avoid-extra-column-after-retry-upon-unknownfieldexception/m-p/140905#M51570</guid>
      <dc:creator>cdn_yyz_yul</dc:creator>
      <dc:date>2025-12-02T17:09:16Z</dc:date>
    </item>
    <item>
      <title>Re: how to avoid extra column after retry upon UnknownFieldException</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-avoid-extra-column-after-retry-upon-unknownfieldexception/m-p/141013#M51603</link>
      <description>&lt;P&gt;Even though the input is csv, it has indeed some rows mis-formatted.&amp;nbsp;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 03 Dec 2025 12:39:31 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-avoid-extra-column-after-retry-upon-unknownfieldexception/m-p/141013#M51603</guid>
      <dc:creator>cdn_yyz_yul</dc:creator>
      <dc:date>2025-12-03T12:39:31Z</dc:date>
    </item>
  </channel>
</rss>

