cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Autoloader: struct field inferred as string

yit
Contributor

We are currently implementing Autoloader for JSON files with nested struct fields. The goal is to detect the fields as structs, and to have schema evolution.

The schema evolution mode is set to addNewColumns, and inferColumnTypes option is set to true to detect the real types of the fields instead of making them all strings. 

One of the fields is deeply nested struct. There are some empty files in the folder as well. The problem is that Autoloader infers the field as string. 

What could be the issue? 

I cannot use schemaHints to define the field as struct, because not even one field exists in all files, while schemaHints expects to define at least one field for struct types.

1 ACCEPTED SOLUTION

Accepted Solutions

yit
Contributor

It is possible to define empty struct as column type through schema hints, but it won't do schema evolution if subfields appear in the data for that column. 

Conclusion when working with JSON files and 'addNewColumns' as schema evolution mode: 

  • You can give partial schema through schema hints.
  • If you give schema hint for some field which is of type struct, you must provide the full schema of the struct. Otherwise it will raise an error, as schema evolution does not apply for fields for which we have defined schema hints.
  • If a column has data from both struct and array types, it will infer it as string - the most generic type to represent both struct and arrays.

View solution in original post

7 REPLIES 7

szymon_dybczak
Esteemed Contributor III

Hi @yit ,

This is expected and documented behaviour of autloader schema inference:

szymon_dybczak_0-1757598803515.png

Configure schema inference and evolution in Auto Loader | Databricks on AWS

@szymon_dybczak Not if you set the option inferColumnTypes to true. 

yit_0-1757606885202.png

 

szymon_dybczak
Esteemed Contributor III

Thanks, good to know about this option. But if inferColumnTypes options doesn't work in this case, maybe you can try new VARIANT type instead? 

@szymon_dybczak Struct has its own benefits over variant. It's more memory efficient, and it shows the full nested schema in Overview in Unity Catalog, while the variant type just shows the name of the first level field. Basically you can navigate through it very easily.

And many other limitations described in https://docs.databricks.com/aws/en/delta/variant#limitations .

szymon_dybczak
Esteemed Contributor III

Yep, it does have a limitation. But did you see some benchmark that compared memory efficiency between variant vs struct? I'm quite sure that in some marketing materials databricks claimed that variant should be much more efficient - especially working with JSON files. So, it's quite interesting. Maybe that's another example to conduct own benchmarks and to always double check marketing claims 😄

@szymon_dybczak I did not conduct memory efficiency benchmarking, as it's not the main evaluation we seek for. Considering the current scenario, we definitely want to use struct.
The problem that should be solved is how to define an empty struct in schema hints (it raises an error that at least one field in the struct should be defined)?

yit
Contributor

It is possible to define empty struct as column type through schema hints, but it won't do schema evolution if subfields appear in the data for that column. 

Conclusion when working with JSON files and 'addNewColumns' as schema evolution mode: 

  • You can give partial schema through schema hints.
  • If you give schema hint for some field which is of type struct, you must provide the full schema of the struct. Otherwise it will raise an error, as schema evolution does not apply for fields for which we have defined schema hints.
  • If a column has data from both struct and array types, it will infer it as string - the most generic type to represent both struct and arrays.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now