topic Autoloader infering struct as a string when reading json data in Data Engineering

Autoloader infering struct as a string when reading json data

robertomatus — Thu, 20 Feb 2025 17:27:18 GMT

Hi Everyone,

Trying to read JSON files with autoloader is failing to infer the schema correctly, every nested or struct column is being inferred as a string.

spark.readStream.format("cloudFiles") .option("cloudFiles.format", "json") .option("cloudFiles.schemaLocation", CHECKPOINT_PATH) .option("multiLine", True) .load(f"/Volumes/{CATALOG}/{SCHEMA}/files/")

When I read the same files normally with spark is actually to infer the schema correctly.

spark.read.format("json").option("multiline", True).load(f"/Volumes/{CATALOG}/{SCHEMA}/files/")

I've deleted the checkpoint also to see if that was causing the problem but still the same.

Here are the schemas to compare

Autoloader:

Spark Normal:

Re: Autoloader infering struct as a string when reading json data

Brahmareddy — Fri, 21 Feb 2025 03:55:23 GMT

Hi @robertomatus ,

As per my understanding, It looks like Auto Loader isn't inferring nested structures correctly, likely because of how it handles schema inference differently from spark.read.json().

You can try explicitly defining the schema using .schema() to ensure it recognizes structs properly. If your schema changes over time, enabling schema evolution with .option("cloudFiles.inferColumnTypes", "true") and .option("cloudFiles.schemaEvolutionMode", "rescue") might help.

Alternatively, pre-processing your JSON files with spark.read.json() before using Auto Loader could ensure the correct structure.

Hope this helps! Let me know if you need more details.

Regards,

Brahma

Re: Autoloader infering struct as a string when reading json data

robertomatus — Fri, 21 Feb 2025 09:01:45 GMT

Hi @Brahmareddy

Thank you for your answer, I found some ways of getting the schema from spark.read.json() and then give it to the autoloader, which works, but the thing is it would be better if we wouldn't have to find this types of workarounds.

If autoloader is just basically spark streaming why they infer the schema differently.

Re: Autoloader infering struct as a string when reading json data

Brahmareddy — Fri, 21 Feb 2025 15:57:43 GMT

Hi @robertomatus ,

You're right—it would be much better if we didn’t have to rely on workarounds. The reason AutoLoader infers schema differently from spark.read.json() is that it's optimized for streaming large-scale data efficiently. Unlike spark.read.json(), which scans all files, AutoLoader samples data to infer schema faster and supports incremental schema evolution for handling new columns over time.

If you want a more reliable approach, consider defining the schema manually and passing it to AutoLoader instead of relying on inference. Another option is to use spark.read.json() on a small sample once, extract the schema, and then provide it to AutoLoader. You can also enable schema evolution using .option("cloudFiles.schemaEvolutionMode", "rescue") to handle unexpected changes dynamically.

While it would be great if AutoLoader handled this seamlessly, these steps can help make schema inference more predictable and reduce inconsistencies.

Hoping you have a good day.

Regards,

Brahma