topic Re: Auto Loader Schema Hint Behavior: Addressing Nested Field Errors in Data Engineering

Auto Loader Schema Hint Behavior: Addressing Nested Field Errors

my_super_name — Mon, 15 Apr 2024 12:37:51 GMT

Hello,
I'm using the auto loader to stream a table of data and have added schema hints to specify field values.
I've observed that when my initial data file is missing fields specified in the schema hint,
the auto loader correctly identifies this and adds them to the schema.

However, if these missing fields are nested within a struct, it throws an error stating "Couldn't find column example in:",
despite setting the attribute cloudFiles.inferColumnTypes = True.

For example, with the schema hints:
SCHEMA_HINTS = [
'aaa TIMESTAMP',
'bbb.ccc INT']

If the first data file contains:
{
   "aaa": "2020-09-22T00:00:00Z",
   "bbb": {
      "ccc": 1234
},
   "ddd": "blabla"
}

Then ddd is added to the schema seamlessly.

However, if the first data file is missing fields within the struct, like so:
{
"aaa": "2020-09-22T00:00:00Z",
"ddd": "blabla"
}

Then an error occurs:
Couldn't find column bbb in:
root
|-- aaa: timestamp (nullable = true)
|-- ddd: string (nullable = true)

Why doesn't the auto loader add these fields to the schema in this case?
Is there a solution to ensure it does?

Thank you!

Re: Auto Loader Schema Hint Behavior: Addressing Nested Field Errors

my_super_name — Thu, 18 Apr 2024 10:18:25 GMT

Hi @Retired_mod

Thanks for your help!
Your solution works for the initial issue,
and I've implemented it first in my code.

but it creates a other problem.
When we explicitly define the struct hint as 'bbb STRUCT<ccc: INT>',
it works until someone adds more fields to 'bbb'.

For example, with this data:
```python
data_file = [{"aaa": "2020-09-22T00:00:00Z", "bbb": {"ccc": 1234, "eee": "blabla"}, "ddd": "blabla"}]
```
Using these Schema Hints:
```python
SCHEMA_HINTS = [
'aaa TIMESTAMP',
'bbb STRUCT<ccc: INT>',
'ddd STRING'
]
```
We get an error because it can't handle additional fields in 'bbb' that are not specified in the hint:
org.apache.spark.sql.catalyst.util.UnknownFieldException:
```python
[UNKNOWN_FIELD_EXCEPTION.NEW_FIELDS_IN_RECORD_WITH_FILE_PATH] Encountered unknown fields during parsing: {"bbb":{"eee":"blabla"}}, which can be fixed by an automatic retry: false
```
The original Schema Hints we started with:
```python
SCHEMA_HINTS = [
'aaa TIMESTAMP',
'bbb.ccc INT'
]
```
do not have this problem and will add 'eee' if it exists in the data.

Currently, to work around the issue,
we've implemented a temporary solution.
We generate an initial data file that includes all nested fields specified in the Schema Hints,
such as 'bbb', and always write it to our source directory.
This file is then discarded after schema creation.

However, I'd love to hear if there's a better solution that addresses the problem more elegantly.
Thank you very much!

Re: Auto Loader Schema Hint Behavior: Addressing Nested Field Errors

Mathias_Peters — Mon, 06 Jan 2025 09:38:34 GMT

Hi,
we are having similar issues with schema hints formulated in fully qualified DDL, e.g. "a STRUCT<b INT>" etc. Did you find a solution?

Also, did you specify the schema hint using the dot-notation, e.g. "a.b INT" before ingesting any data or after (i.e. as a type override)?

I keep getting errors such as this:
The following is not a valid schema hint:

data.id STRING, metadata.timestamp STRING,...

Use either a column descriptor syntax that overrides a leaf column (e.g. `a.element.b LONG`), or a fully qualified DDL that merges with the inferred schema completely (e.g. `STRUCT<a ARRAY<STRUCT<b LONG>>>` SQLSTATE: 42601

Thank you!

Re: Auto Loader Schema Hint Behavior: Addressing Nested Field Errors

777 — Thu, 31 Jul 2025 15:24:06 GMT

Hi, I encountered the same problem. Have you found a solution?