Databricks Community

my_super_name · ‎04-15-2024

Hello,
I'm using the auto loader to stream a table of data and have added schema hints to specify field values.
I've observed that when my initial data file is missing fields specified in the schema hint,
the auto loader correctly identifies this and adds them to the schema.

However, if these missing fields are nested within a struct, it throws an error stating "Couldn't find column example in:",
despite setting the attribute cloudFiles.inferColumnTypes = True.

For example, with the schema hints:
SCHEMA_HINTS = [
'aaa TIMESTAMP',
'bbb.ccc INT']

If the first data file contains:
{
   "aaa": "2020-09-22T00:00:00Z",
   "bbb": {
      "ccc": 1234
},
   "ddd": "blabla"
}

Then ddd is added to the schema seamlessly.

However, if the first data file is missing fields within the struct, like so:
{
"aaa": "2020-09-22T00:00:00Z",
"ddd": "blabla"
}

Then an error occurs:
Couldn't find column bbb in:
root
|-- aaa: timestamp (nullable = true)
|-- ddd: string (nullable = true)

Why doesn't the auto loader add these fields to the schema in this case?
Is there a solution to ensure it does?

Thank you!

my_super_name · ‎04-18-2024

Hi @Retired_mod

Thanks for your help!
Your solution works for the initial issue,
and I've implemented it first in my code.

but it creates a other problem.
When we explicitly define the struct hint as 'bbb STRUCT<ccc: INT>',
it works until someone adds more fields to 'bbb'.

For example, with this data:
```python
data_file = [{"aaa": "2020-09-22T00:00:00Z", "bbb": {"ccc": 1234, "eee": "blabla"}, "ddd": "blabla"}]
```
Using these Schema Hints:
```python
SCHEMA_HINTS = [
'aaa TIMESTAMP',
'bbb STRUCT<ccc: INT>',
'ddd STRING'
]
```
We get an error because it can't handle additional fields in 'bbb' that are not specified in the hint:
org.apache.spark.sql.catalyst.util.UnknownFieldException:
```python
[UNKNOWN_FIELD_EXCEPTION.NEW_FIELDS_IN_RECORD_WITH_FILE_PATH] Encountered unknown fields during parsing: {"bbb":{"eee":"blabla"}}, which can be fixed by an automatic retry: false
```
The original Schema Hints we started with:
```python
SCHEMA_HINTS = [
'aaa TIMESTAMP',
'bbb.ccc INT'
]
```
do not have this problem and will add 'eee' if it exists in the data.

Currently, to work around the issue,
we've implemented a temporary solution.
We generate an initial data file that includes all nested fields specified in the Schema Hints,
such as 'bbb', and always write it to our source directory.
This file is then discarded after schema creation.

However, I'd love to hear if there's a better solution that addresses the problem more elegantly.
Thank you very much!

Databricks Community

Auto Loader Schema Hint Behavior: Addressing Nested Field Errors

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 10 October - 31 October

Databricks Hybrid Learning Day - New York City

Databricks Migration Strategy: Lessons Learned

What’s New With Databricks Assistant?

Introducing Simple, Fast, and Scalable Batch LLM Inference on Mosaic AI Model Serving