<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Auto Loader Schema Hint Behavior: Addressing Nested Field Errors in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/auto-loader-schema-hint-behavior-addressing-nested-field-errors/m-p/66582#M33156</link>
    <description>&lt;P&gt;&lt;BR /&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9"&gt;@Retired_mod&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks for your help!&lt;BR /&gt;Your solution works for the initial issue,&lt;BR /&gt;and I've implemented it first in my code.&lt;BR /&gt;&lt;BR /&gt;but it creates a other problem.&lt;BR /&gt;When we explicitly define the struct hint as 'bbb STRUCT&amp;lt;ccc: INT&amp;gt;',&lt;BR /&gt;it works until someone adds more fields to 'bbb'.&lt;/P&gt;&lt;P&gt;For example, with this data:&lt;BR /&gt;```python&lt;BR /&gt;data_file = [{"aaa": "2020-09-22T00:00:00Z", "bbb": {"ccc": 1234, "eee": "blabla"}, "ddd": "blabla"}]&lt;BR /&gt;```&lt;BR /&gt;Using these Schema Hints:&lt;BR /&gt;```python&lt;BR /&gt;SCHEMA_HINTS = [&lt;BR /&gt;'aaa TIMESTAMP',&lt;BR /&gt;'bbb STRUCT&amp;lt;ccc: INT&amp;gt;',&lt;BR /&gt;'ddd STRING'&lt;BR /&gt;]&lt;BR /&gt;```&lt;BR /&gt;We get an error because it can't handle additional fields in 'bbb' that are not specified in the hint:&lt;BR /&gt;org.apache.spark.sql.catalyst.util.UnknownFieldException:&lt;BR /&gt;```python&lt;BR /&gt;[UNKNOWN_FIELD_EXCEPTION.NEW_FIELDS_IN_RECORD_WITH_FILE_PATH] Encountered unknown fields during parsing: {"bbb":{"eee":"blabla"}}, which can be fixed by an automatic retry: false&lt;BR /&gt;```&lt;BR /&gt;The original Schema Hints we started with:&lt;BR /&gt;```python&lt;BR /&gt;SCHEMA_HINTS = [&lt;BR /&gt;'aaa TIMESTAMP',&lt;BR /&gt;'bbb.ccc INT'&lt;BR /&gt;]&lt;BR /&gt;```&lt;BR /&gt;do not have this problem and will add 'eee' if it exists in the data.&lt;/P&gt;&lt;P&gt;Currently, to work around the issue,&lt;BR /&gt;we've implemented a temporary solution.&lt;BR /&gt;We generate an initial data file that includes all nested fields specified in the Schema Hints,&lt;BR /&gt;such as 'bbb', and always write it to our source directory.&lt;BR /&gt;This file is then discarded after schema creation.&lt;/P&gt;&lt;P&gt;However, I'd love to hear if there's a better solution that addresses the problem more elegantly.&lt;BR /&gt;Thank you very much!&lt;/P&gt;</description>
    <pubDate>Thu, 18 Apr 2024 10:18:25 GMT</pubDate>
    <dc:creator>my_super_name</dc:creator>
    <dc:date>2024-04-18T10:18:25Z</dc:date>
    <item>
      <title>Auto Loader Schema Hint Behavior: Addressing Nested Field Errors</title>
      <link>https://community.databricks.com/t5/data-engineering/auto-loader-schema-hint-behavior-addressing-nested-field-errors/m-p/66266#M33066</link>
      <description>&lt;P&gt;Hello,&lt;BR /&gt;I'm using the auto loader to stream a table of data and have added schema hints to specify field values.&lt;BR /&gt;I've observed that when my initial data file is missing fields specified in the schema hint,&lt;BR /&gt;the auto loader correctly identifies this and adds them to the schema.&lt;BR /&gt;&lt;BR /&gt;However, if these missing fields are nested within a struct, it throws an error stating "Couldn't find column example in:",&lt;BR /&gt;despite setting the attribute cloudFiles.inferColumnTypes = True.&lt;/P&gt;&lt;P&gt;For example, with the schema hints:&lt;BR /&gt;SCHEMA_HINTS = [&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 'aaa TIMESTAMP',&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; 'bbb.ccc INT']&lt;/P&gt;&lt;P&gt;If the first data file contains:&lt;BR /&gt;{&lt;BR /&gt;&amp;nbsp;&amp;nbsp; "aaa": "2020-09-22T00:00:00Z",&lt;BR /&gt;&amp;nbsp;&amp;nbsp; "bbb": {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "ccc": 1234&lt;BR /&gt;},&lt;BR /&gt;&amp;nbsp;&amp;nbsp; "ddd": "blabla"&lt;BR /&gt;}&lt;/P&gt;&lt;P&gt;Then ddd is added to the schema seamlessly.&lt;BR /&gt;&lt;BR /&gt;However, if the first data file is missing fields within the struct, like so:&lt;BR /&gt;{&lt;BR /&gt;&amp;nbsp;&amp;nbsp; "aaa": "2020-09-22T00:00:00Z",&lt;BR /&gt;&amp;nbsp;&amp;nbsp; "ddd": "blabla"&lt;BR /&gt;}&lt;/P&gt;&lt;P&gt;Then an error occurs:&lt;BR /&gt;Couldn't find column bbb in:&lt;BR /&gt;root&lt;BR /&gt;|-- aaa: timestamp (nullable = true)&lt;BR /&gt;|-- ddd: string (nullable = true)&lt;/P&gt;&lt;P&gt;Why doesn't the auto loader add these fields to the schema in this case?&lt;BR /&gt;Is there a solution to ensure it does?&lt;BR /&gt;&lt;BR /&gt;Thank you!&lt;/P&gt;</description>
      <pubDate>Mon, 15 Apr 2024 12:37:51 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/auto-loader-schema-hint-behavior-addressing-nested-field-errors/m-p/66266#M33066</guid>
      <dc:creator>my_super_name</dc:creator>
      <dc:date>2024-04-15T12:37:51Z</dc:date>
    </item>
    <item>
      <title>Re: Auto Loader Schema Hint Behavior: Addressing Nested Field Errors</title>
      <link>https://community.databricks.com/t5/data-engineering/auto-loader-schema-hint-behavior-addressing-nested-field-errors/m-p/66582#M33156</link>
      <description>&lt;P&gt;&lt;BR /&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9"&gt;@Retired_mod&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks for your help!&lt;BR /&gt;Your solution works for the initial issue,&lt;BR /&gt;and I've implemented it first in my code.&lt;BR /&gt;&lt;BR /&gt;but it creates a other problem.&lt;BR /&gt;When we explicitly define the struct hint as 'bbb STRUCT&amp;lt;ccc: INT&amp;gt;',&lt;BR /&gt;it works until someone adds more fields to 'bbb'.&lt;/P&gt;&lt;P&gt;For example, with this data:&lt;BR /&gt;```python&lt;BR /&gt;data_file = [{"aaa": "2020-09-22T00:00:00Z", "bbb": {"ccc": 1234, "eee": "blabla"}, "ddd": "blabla"}]&lt;BR /&gt;```&lt;BR /&gt;Using these Schema Hints:&lt;BR /&gt;```python&lt;BR /&gt;SCHEMA_HINTS = [&lt;BR /&gt;'aaa TIMESTAMP',&lt;BR /&gt;'bbb STRUCT&amp;lt;ccc: INT&amp;gt;',&lt;BR /&gt;'ddd STRING'&lt;BR /&gt;]&lt;BR /&gt;```&lt;BR /&gt;We get an error because it can't handle additional fields in 'bbb' that are not specified in the hint:&lt;BR /&gt;org.apache.spark.sql.catalyst.util.UnknownFieldException:&lt;BR /&gt;```python&lt;BR /&gt;[UNKNOWN_FIELD_EXCEPTION.NEW_FIELDS_IN_RECORD_WITH_FILE_PATH] Encountered unknown fields during parsing: {"bbb":{"eee":"blabla"}}, which can be fixed by an automatic retry: false&lt;BR /&gt;```&lt;BR /&gt;The original Schema Hints we started with:&lt;BR /&gt;```python&lt;BR /&gt;SCHEMA_HINTS = [&lt;BR /&gt;'aaa TIMESTAMP',&lt;BR /&gt;'bbb.ccc INT'&lt;BR /&gt;]&lt;BR /&gt;```&lt;BR /&gt;do not have this problem and will add 'eee' if it exists in the data.&lt;/P&gt;&lt;P&gt;Currently, to work around the issue,&lt;BR /&gt;we've implemented a temporary solution.&lt;BR /&gt;We generate an initial data file that includes all nested fields specified in the Schema Hints,&lt;BR /&gt;such as 'bbb', and always write it to our source directory.&lt;BR /&gt;This file is then discarded after schema creation.&lt;/P&gt;&lt;P&gt;However, I'd love to hear if there's a better solution that addresses the problem more elegantly.&lt;BR /&gt;Thank you very much!&lt;/P&gt;</description>
      <pubDate>Thu, 18 Apr 2024 10:18:25 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/auto-loader-schema-hint-behavior-addressing-nested-field-errors/m-p/66582#M33156</guid>
      <dc:creator>my_super_name</dc:creator>
      <dc:date>2024-04-18T10:18:25Z</dc:date>
    </item>
    <item>
      <title>Re: Auto Loader Schema Hint Behavior: Addressing Nested Field Errors</title>
      <link>https://community.databricks.com/t5/data-engineering/auto-loader-schema-hint-behavior-addressing-nested-field-errors/m-p/104292#M41703</link>
      <description>&lt;P&gt;Hi,&amp;nbsp;&lt;BR /&gt;we are having similar issues with schema hints formulated in&amp;nbsp;&lt;SPAN&gt;fully qualified DDL, e.g. "a STRUCT&amp;lt;b INT&amp;gt;" etc. Did you find a solution?&amp;nbsp;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&lt;BR /&gt;Also, did you specify the schema hint using the dot-notation, e.g. "a.b INT" &lt;STRONG&gt;before&lt;/STRONG&gt;&amp;nbsp;ingesting any data or after (i.e. as a type override)?&amp;nbsp;&lt;BR /&gt;&lt;BR /&gt;I keep getting errors such as this:&amp;nbsp;&lt;BR /&gt;The following is not a valid schema hint: &lt;/SPAN&gt;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;data.id STRING, metadata.timestamp STRING,...&lt;/LI-CODE&gt;&lt;P&gt;&lt;SPAN&gt;Use either a column descriptor syntax that overrides a leaf column (e.g. `a.element.b LONG`), or a fully qualified DDL that merges with the inferred schema completely (e.g. `STRUCT&amp;lt;a ARRAY&amp;lt;STRUCT&amp;lt;b LONG&amp;gt;&amp;gt;&amp;gt;` SQLSTATE: 42601&lt;BR /&gt;&lt;BR /&gt;Thank you!&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 06 Jan 2025 09:38:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/auto-loader-schema-hint-behavior-addressing-nested-field-errors/m-p/104292#M41703</guid>
      <dc:creator>Mathias_Peters</dc:creator>
      <dc:date>2025-01-06T09:38:34Z</dc:date>
    </item>
    <item>
      <title>Re: Auto Loader Schema Hint Behavior: Addressing Nested Field Errors</title>
      <link>https://community.databricks.com/t5/data-engineering/auto-loader-schema-hint-behavior-addressing-nested-field-errors/m-p/127111#M47859</link>
      <description>&lt;P&gt;Hi, I encountered the same problem. Have you found a solution?&lt;/P&gt;</description>
      <pubDate>Thu, 31 Jul 2025 15:24:06 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/auto-loader-schema-hint-behavior-addressing-nested-field-errors/m-p/127111#M47859</guid>
      <dc:creator>777</dc:creator>
      <dc:date>2025-07-31T15:24:06Z</dc:date>
    </item>
  </channel>
</rss>

