Louis_Frolio
Databricks Employee
Databricks Employee

Greetings @Joost1024 , I did some digging. 

You’re running into a root type mismatch.

Your JSON’s top level is an array of arrays, but the schema you provided describes a single struct (one record). Spark can’t reconcile those two shapes, so it does what it always does in this situation: it gives you one row per file and fills the struct fields with nulls.

What’s happening

The file’s root is an array of arrays of structs — think [[{...}, {...}], [...]]. Your schema, however, describes only the inner object, not the outer container. When multiLine JSON is enabled, Spark treats each file as a single JSON value. Since the root type doesn’t match the schema, Spark can’t align the fields and you end up with nulls across the board.

The good news: you don’t need to change the source JSON. There are two clean ways to handle this directly in Spark.

Option A: Let Spark infer the nested arrays, then explode twice

Have Spark read the file as-is, infer the shape, and flatten it step by step:

from pyspark.sql import functions as F

df0 = (spark.read.format("json")
       .option("multiLine", "true")
       .load("<S3 location>"))

# df0 has a single column `value`: array<array<struct<entity_id,...>>>
df = (df0
      .select(F.explode("value").alias("arr"))   # array<struct>
      .select(F.explode("arr").alias("row"))     # struct
      .select(
          "row.entity_id",
          F.col("row.state").alias("state"),
          "row.attributes",
          F.to_timestamp(
              "row.last_changed",
              "yyyy-MM-dd'T'HH:mm:ssXXX"
          ).alias("last_changed"),
          F.to_timestamp(
              "row.last_updated",
              "yyyy-MM-dd'T'HH:mm:ssXXX"
          ).alias("last_updated"),
      ))

display(df.limit(10))

A couple of notes:

  • The timestamp pattern yyyy-MM-dd'T'HH:mm:ssXXX correctly handles ISO-8601 offsets like +00:00.

  • If state should be numeric, just cast it after the explode.

Option B: Define a schema that actually matches the root

Instead of fighting the JSON shape, describe it accurately: a single top-level field that is an array of arrays of structs.

from pyspark.sql import functions as F, types as T

inner = T.StructType([
    T.StructField("entity_id", T.StringType(), False),
    T.StructField("state", T.StringType(), True),
    T.StructField("attributes", T.MapType(T.StringType(), T.StringType()), True),
    T.StructField("last_changed", T.StringType(), False),
    T.StructField("last_updated", T.StringType(), False),
])

schema = T.StructType([
    T.StructField("value", T.ArrayType(T.ArrayType(inner)), True)
])

df0 = (spark.read.format("json")
       .option("multiLine", "true")
       # .option("primitivesAsString", "true")  # optional, see notes below
       .schema(schema)
       .load("<S3 location>"))

df = (df0
      .select(F.explode("value").alias("arr"))
      .select(F.explode("arr").alias("row"))
      .select(
          "row.entity_id",
          F.col("row.state").alias("state"),
          "row.attributes",
          F.to_timestamp(
              "row.last_changed",
              "yyyy-MM-dd'T'HH:mm:ssXXX"
          ).alias("last_changed"),
          F.to_timestamp(
              "row.last_updated",
              "yyyy-MM-dd'T'HH:mm:ssXXX"
          ).alias("last_updated"),
      ))

display(df.limit(10))

Extra tips worth knowing

If attributes can contain numbers or booleans, you have a couple of safe options:

  • Use .option("primitivesAsString", "true") so everything lands as strings and nothing silently becomes null.

  • Or widen the map type and normalize downstream once the data is flattened.

Also worth calling out: your original approach worked once you flattened the JSON externally because you removed that extra array level. The double-explode here is doing the same thing, just inside Spark where it belongs.

 

Hope this helps, Louis.