Databricks Community

robertomatus · ‎02-20-2025

Hi Everyone,

Trying to read JSON files with autoloader is failing to infer the schema correctly, every nested or struct column is being inferred as a string.

spark.readStream.format("cloudFiles")
 .option("cloudFiles.format", "json")
 .option("cloudFiles.schemaLocation", CHECKPOINT_PATH)
 .option("multiLine", True)
 .load(f"/Volumes/{CATALOG}/{SCHEMA}/files/")

When I read the same files normally with spark is actually to infer the schema correctly.

spark.read.format("json").option("multiline", True).load(f"/Volumes/{CATALOG}/{SCHEMA}/files/")

I've deleted the checkpoint also to see if that was causing the problem but still the same.

Here are the schemas to compare

Autoloader:

root
 |-- changelog: string (nullable = true)
 |-- issue: string (nullable = true)
 |-- issue_event_type_name: string (nullable = true)
 |-- timestamp: string (nullable = true)
 |-- user: string (nullable = true)
 |-- webhookEvent: string (nullable = true)
 |-- project: string (nullable = true)
 |-- year: string (nullable = true)
 |-- month: string (nullable = true)
 |-- day: string (nullable = true)
 |-- issue_id: string (nullable = true)
 |-- _rescued_data: string (nullable = true)

Spark Normal:

root
 |-- changelog: struct (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- items: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- field: string (nullable = true)
 |    |    |    |-- fieldId: string (nullable = true)
 |    |    |    |-- fieldtype: string (nullable = true)
 |    |    |    |-- from: string (nullable = true)
 |    |    |    |-- fromString: string (nullable = true)
 |    |    |    |-- tmpFromAccountId: string (nullable = true)
 |    |    |    |-- tmpToAccountId: string (nullable = true)
 |    |    |    |-- to: string (nullable = true)
 |    |    |    |-- toString: string (nullable = true)
 |-- issue: struct (nullable = true)
 |    |-- fields: struct (nullable = true)
 |    |    |-- assignee: string (nullable = true)
 |    |    |-- attachment: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- components: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- created: string (nullable = true)
 |    |    |-- creator: struct (nullable = true)
 |    |    |    |-- accountId: string (nullable = true)
 |    |    |    |-- accountType: string (nullable = true)
 |    |    |    |-- active: boolean (nullable = true)
 |    |    |    |-- avatarUrls: struct (nullable = true)
 |    |    |    |    |-- 16x16: string (nullable = true)
...

Brahmareddy · ‎02-20-2025

Hi @robertomatus ,

As per my understanding, It looks like Auto Loader isn't inferring nested structures correctly, likely because of how it handles schema inference differently from spark.read.json().

You can try explicitly defining the schema using .schema() to ensure it recognizes structs properly. If your schema changes over time, enabling schema evolution with .option("cloudFiles.inferColumnTypes", "true") and .option("cloudFiles.schemaEvolutionMode", "rescue") might help.

Alternatively, pre-processing your JSON files with spark.read.json() before using Auto Loader could ensure the correct structure.

Hope this helps! Let me know if you need more details.

Regards,

Brahma

robertomatus · ‎02-21-2025

Hi @Brahmareddy

Thank you for your answer, I found some ways of getting the schema from spark.read.json() and then give it to the autoloader, which works, but the thing is it would be better if we wouldn't have to find this types of workarounds.

If autoloader is just basically spark streaming why they infer the schema differently.

Brahmareddy · ‎02-21-2025

Hi @robertomatus ,

You're right—it would be much better if we didn’t have to rely on workarounds. The reason AutoLoader infers schema differently from spark.read.json() is that it's optimized for streaming large-scale data efficiently. Unlike spark.read.json(), which scans all files, AutoLoader samples data to infer schema faster and supports incremental schema evolution for handling new columns over time.

If you want a more reliable approach, consider defining the schema manually and passing it to AutoLoader instead of relying on inference. Another option is to use spark.read.json() on a small sample once, extract the schema, and then provide it to AutoLoader. You can also enable schema evolution using .option("cloudFiles.schemaEvolutionMode", "rescue") to handle unexpected changes dynamically.

While it would be great if AutoLoader handled this seamlessly, these steps can help make schema inference more predictable and reduce inconsistencies.

Hoping you have a good day.

Regards,

Brahma