cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Auto Loader Schema Hint Behavior: Addressing Nested Field Errors

my_super_name
New Contributor II

Hello,
I'm using the auto loader to stream a table of data and have added schema hints to specify field values.
I've observed that when my initial data file is missing fields specified in the schema hint,
the auto loader correctly identifies this and adds them to the schema.

However, if these missing fields are nested within a struct, it throws an error stating "Couldn't find column example in:",
despite setting the attribute cloudFiles.inferColumnTypes = True.

For example, with the schema hints:
SCHEMA_HINTS = [
     'aaa TIMESTAMP',
     'bbb.ccc INT']

If the first data file contains:
{
   "aaa": "2020-09-22T00:00:00Z",
   "bbb": {
      "ccc": 1234
},
   "ddd": "blabla"
}

Then ddd is added to the schema seamlessly.

However, if the first data file is missing fields within the struct, like so:
{
   "aaa": "2020-09-22T00:00:00Z",
   "ddd": "blabla"
}

Then an error occurs:
Couldn't find column bbb in:
root
|-- aaa: timestamp (nullable = true)
|-- ddd: string (nullable = true)

Why doesn't the auto loader add these fields to the schema in this case?
Is there a solution to ensure it does?

Thank you!

2 REPLIES 2

Kaniz
Community Manager
Community Manager

Hi @my_super_name

  1. Default Schema Inference: By default, Auto Loader schema inference aims to avoid schema evolution issues due to type mismatches. For formats like JSON, CSV, and XML that don’t encode data types explicitly, Auto Loader infers all columns as strings. This includes nested fields in JSON files1.

  2. Nested Fields and Missing Columns: When your initial data file is missing fields specified in the schema hint, Auto Loader correctly identifies this and adds them to the schema. However, if these missing fields are nested within a struct, the behavior changes. Let’s consider your example:

    • Schema Hints:

      SCHEMA_HINTS = [
          'aaa TIMESTAMP',
          'bbb.ccc INT'
      ]
      
    • First Data File (with all fields):

      {
          "aaa": "2020-09-22T00:00:00Z",
          "bbb": {
              "ccc": 1234
          },
          "ddd": "blabla"
      }
      

    In this case, ddd is added to the schema seamlessly.

    • First Data File (missing fields within the struct):
      {
          "aaa": "2020-09-22T00:00:00Z",
          "ddd": "blabla"
      }
      

    The error occurs because Auto Loader couldn’t find the column bbb in the schema. It only infers columns at the top level, not within nested structures.

  3. Solution: To ensure that missing fields within a struct are added to the schema, you can follow these steps:

    • Explicitly define the nested fields in your schema hints. For example:
      SCHEMA_HINTS = [
          'aaa TIMESTAMP',
          'bbb STRUCT<ccc: INT>',
          'ddd STRING'
      ]
      

    By specifying the nested field bbb as a STRUCT<ccc: INT>, Auto Loader will correctly infer the nested structure and add missing fields to the schema.

    Remember to adjust your schema hints to match the actual structure of your data files.

In summary, explicitly defining nested fields in your schema hints will ensure that Auto Loader corr....

Happy data streaming! 🚀

 

my_super_name
New Contributor II


Hi @Kaniz 

Thanks for your help!
Your solution works for the initial issue,
and I've implemented it first in my code.

but it creates a other problem.
When we explicitly define the struct hint as 'bbb STRUCT<ccc: INT>',
it works until someone adds more fields to 'bbb'.

For example, with this data:
```python
data_file = [{"aaa": "2020-09-22T00:00:00Z", "bbb": {"ccc": 1234, "eee": "blabla"}, "ddd": "blabla"}]
```
Using these Schema Hints:
```python
SCHEMA_HINTS = [
'aaa TIMESTAMP',
'bbb STRUCT<ccc: INT>',
'ddd STRING'
]
```
We get an error because it can't handle additional fields in 'bbb' that are not specified in the hint:
org.apache.spark.sql.catalyst.util.UnknownFieldException:
```python
[UNKNOWN_FIELD_EXCEPTION.NEW_FIELDS_IN_RECORD_WITH_FILE_PATH] Encountered unknown fields during parsing: {"bbb":{"eee":"blabla"}}, which can be fixed by an automatic retry: false
```
The original Schema Hints we started with:
```python
SCHEMA_HINTS = [
'aaa TIMESTAMP',
'bbb.ccc INT'
]
```
do not have this problem and will add 'eee' if it exists in the data.

Currently, to work around the issue,
we've implemented a temporary solution.
We generate an initial data file that includes all nested fields specified in the Schema Hints,
such as 'bbb', and always write it to our source directory.
This file is then discarded after schema creation.

However, I'd love to hear if there's a better solution that addresses the problem more elegantly.
Thank you very much!