cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Auto Loader Schema Hint Behavior: Addressing Nested Field Errors

my_super_name
New Contributor II

Hello,
I'm using the auto loader to stream a table of data and have added schema hints to specify field values.
I've observed that when my initial data file is missing fields specified in the schema hint,
the auto loader correctly identifies this and adds them to the schema.

However, if these missing fields are nested within a struct, it throws an error stating "Couldn't find column example in:",
despite setting the attribute cloudFiles.inferColumnTypes = True.

For example, with the schema hints:
SCHEMA_HINTS = [
     'aaa TIMESTAMP',
     'bbb.ccc INT']

If the first data file contains:
{
   "aaa": "2020-09-22T00:00:00Z",
   "bbb": {
      "ccc": 1234
},
   "ddd": "blabla"
}

Then ddd is added to the schema seamlessly.

However, if the first data file is missing fields within the struct, like so:
{
   "aaa": "2020-09-22T00:00:00Z",
   "ddd": "blabla"
}

Then an error occurs:
Couldn't find column bbb in:
root
|-- aaa: timestamp (nullable = true)
|-- ddd: string (nullable = true)

Why doesn't the auto loader add these fields to the schema in this case?
Is there a solution to ensure it does?

Thank you!

2 REPLIES 2

my_super_name
New Contributor II


Hi @Retired_mod 

Thanks for your help!
Your solution works for the initial issue,
and I've implemented it first in my code.

but it creates a other problem.
When we explicitly define the struct hint as 'bbb STRUCT<ccc: INT>',
it works until someone adds more fields to 'bbb'.

For example, with this data:
```python
data_file = [{"aaa": "2020-09-22T00:00:00Z", "bbb": {"ccc": 1234, "eee": "blabla"}, "ddd": "blabla"}]
```
Using these Schema Hints:
```python
SCHEMA_HINTS = [
'aaa TIMESTAMP',
'bbb STRUCT<ccc: INT>',
'ddd STRING'
]
```
We get an error because it can't handle additional fields in 'bbb' that are not specified in the hint:
org.apache.spark.sql.catalyst.util.UnknownFieldException:
```python
[UNKNOWN_FIELD_EXCEPTION.NEW_FIELDS_IN_RECORD_WITH_FILE_PATH] Encountered unknown fields during parsing: {"bbb":{"eee":"blabla"}}, which can be fixed by an automatic retry: false
```
The original Schema Hints we started with:
```python
SCHEMA_HINTS = [
'aaa TIMESTAMP',
'bbb.ccc INT'
]
```
do not have this problem and will add 'eee' if it exists in the data.

Currently, to work around the issue,
we've implemented a temporary solution.
We generate an initial data file that includes all nested fields specified in the Schema Hints,
such as 'bbb', and always write it to our source directory.
This file is then discarded after schema creation.

However, I'd love to hear if there's a better solution that addresses the problem more elegantly.
Thank you very much!

Mathias_Peters
Contributor

Hi, 
we are having similar issues with schema hints formulated in fully qualified DDL, e.g. "a STRUCT<b INT>" etc. Did you find a solution? 


Also, did you specify the schema hint using the dot-notation, e.g. "a.b INT" before ingesting any data or after (i.e. as a type override)? 

I keep getting errors such as this: 
The following is not a valid schema hint:

data.id STRING, metadata.timestamp STRING,...

Use either a column descriptor syntax that overrides a leaf column (e.g. `a.element.b LONG`), or a fully qualified DDL that merges with the inferred schema completely (e.g. `STRUCT<a ARRAY<STRUCT<b LONG>>>` SQLSTATE: 42601

Thank you!

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group