Thursday
We are currently implementing Autoloader for JSON files with nested struct fields. The goal is to detect the fields as structs, and to have schema evolution.
The schema evolution mode is set to addNewColumns, and inferColumnTypes option is set to true to detect the real types of the fields instead of making them all strings.
One of the fields is deeply nested struct. There are some empty files in the folder as well. The problem is that Autoloader infers the field as string.
What could be the issue?
I cannot use schemaHints to define the field as struct, because not even one field exists in all files, while schemaHints expects to define at least one field for struct types.
yesterday
It is possible to define empty struct as column type through schema hints, but it won't do schema evolution if subfields appear in the data for that column.
Conclusion when working with JSON files and 'addNewColumns' as schema evolution mode:
Thursday
Hi @yit ,
This is expected and documented behaviour of autloader schema inference:
Configure schema inference and evolution in Auto Loader | Databricks on AWS
Thursday
Thursday - last edited Thursday
Thanks, good to know about this option. But if inferColumnTypes options doesn't work in this case, maybe you can try new VARIANT type instead?
Thursday
@szymon_dybczak Struct has its own benefits over variant. It's more memory efficient, and it shows the full nested schema in Overview in Unity Catalog, while the variant type just shows the name of the first level field. Basically you can navigate through it very easily.
And many other limitations described in https://docs.databricks.com/aws/en/delta/variant#limitations .
Thursday
Yep, it does have a limitation. But did you see some benchmark that compared memory efficiency between variant vs struct? I'm quite sure that in some marketing materials databricks claimed that variant should be much more efficient - especially working with JSON files. So, it's quite interesting. Maybe that's another example to conduct own benchmarks and to always double check marketing claims 😄
Monday
@szymon_dybczak I did not conduct memory efficiency benchmarking, as it's not the main evaluation we seek for. Considering the current scenario, we definitely want to use struct.
The problem that should be solved is how to define an empty struct in schema hints (it raises an error that at least one field in the struct should be defined)?
yesterday
It is possible to define empty struct as column type through schema hints, but it won't do schema evolution if subfields appear in the data for that column.
Conclusion when working with JSON files and 'addNewColumns' as schema evolution mode:
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now