cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Read Array of Arrays of Objects JSON file using Spark

Joost1024
Visitor

Hi Databricks Community! 

This is my first post in this forum, so I hope you can forgive me if it's not according to the forum best practices ๐Ÿ™‚

After lots of searching, I decided to share the peculiar issue I'm running into in this community.

I try to load a JSON format that is exposed via the Home Assistant /api/history/period/ endpoint. The format consists of an array of arrays:

[
[
{
"entity_id": "sensor.solaredge_lifetime_energy",
"state": "19848848.0",
"attributes": {
"state_class": "total",
"unit_of_measurement": "Wh",
"device_class": "energy",
"friendly_name": "solaredge Lifetime energy"
},
"last_changed": "2025-12-14T23:00:00+00:00",
"last_updated": "2025-12-14T23:00:00+00:00"
},
{ ... }
],
[ ... ]
]

Each array contains measurements of a specific sensor. Every object has the same 5 properties / fields.

The file(s) live in S3 and I try to read them as follows:

schema = StructType([
StructField("entity_id", StringType(), False),
StructField("state", StringType(), True),
StructField("attributes", MapType(StringType(), StringType()), True),
StructField("last_changed", TimestampType(), False),
StructField("last_updated", TimestampType(), False),
])

spark.read.format('json').option("multiLine", "true").schema(schema).load(<S3 location>).limit(10).display()

For some strange reason the result is 1 row per file with all null values, while the correct columns are there.

I tried / checked different things so far:

  1. Validated the JSON: The JSON file is valid and doesn't contain any malformed / missing values.
  2. I 'flattened' the list of lists to a single top-level list using flattened = [element for sub_list in data for element in sub_list]. The resulting JSON is loaded just fine and is exactly what I want as the output.
  3. I tried removing the schema definition, but the result remains the same.

Any suggestions? I don't think I should transform the source JSON in order to be able to load it using Spark.

3 REPLIES 3

Louis_Frolio
Databricks Employee
Databricks Employee

Greetings @Joost1024 , I did some digging. 

Youโ€™re running into a root type mismatch.

Your JSONโ€™s top level is an array of arrays, but the schema you provided describes a single struct (one record). Spark canโ€™t reconcile those two shapes, so it does what it always does in this situation: it gives you one row per file and fills the struct fields with nulls.

Whatโ€™s happening

The fileโ€™s root is an array of arrays of structs โ€” think [[{...}, {...}], [...]]. Your schema, however, describes only the inner object, not the outer container. When multiLine JSON is enabled, Spark treats each file as a single JSON value. Since the root type doesnโ€™t match the schema, Spark canโ€™t align the fields and you end up with nulls across the board.

The good news: you donโ€™t need to change the source JSON. There are two clean ways to handle this directly in Spark.

Option A: Let Spark infer the nested arrays, then explode twice

Have Spark read the file as-is, infer the shape, and flatten it step by step:

from pyspark.sql import functions as F

df0 = (spark.read.format("json")
       .option("multiLine", "true")
       .load("<S3 location>"))

# df0 has a single column `value`: array<array<struct<entity_id,...>>>
df = (df0
      .select(F.explode("value").alias("arr"))   # array<struct>
      .select(F.explode("arr").alias("row"))     # struct
      .select(
          "row.entity_id",
          F.col("row.state").alias("state"),
          "row.attributes",
          F.to_timestamp(
              "row.last_changed",
              "yyyy-MM-dd'T'HH:mm:ssXXX"
          ).alias("last_changed"),
          F.to_timestamp(
              "row.last_updated",
              "yyyy-MM-dd'T'HH:mm:ssXXX"
          ).alias("last_updated"),
      ))

display(df.limit(10))

A couple of notes:

  • The timestamp pattern yyyy-MM-dd'T'HH:mm:ssXXX correctly handles ISO-8601 offsets like +00:00.

  • If state should be numeric, just cast it after the explode.

Option B: Define a schema that actually matches the root

Instead of fighting the JSON shape, describe it accurately: a single top-level field that is an array of arrays of structs.

from pyspark.sql import functions as F, types as T

inner = T.StructType([
    T.StructField("entity_id", T.StringType(), False),
    T.StructField("state", T.StringType(), True),
    T.StructField("attributes", T.MapType(T.StringType(), T.StringType()), True),
    T.StructField("last_changed", T.StringType(), False),
    T.StructField("last_updated", T.StringType(), False),
])

schema = T.StructType([
    T.StructField("value", T.ArrayType(T.ArrayType(inner)), True)
])

df0 = (spark.read.format("json")
       .option("multiLine", "true")
       # .option("primitivesAsString", "true")  # optional, see notes below
       .schema(schema)
       .load("<S3 location>"))

df = (df0
      .select(F.explode("value").alias("arr"))
      .select(F.explode("arr").alias("row"))
      .select(
          "row.entity_id",
          F.col("row.state").alias("state"),
          "row.attributes",
          F.to_timestamp(
              "row.last_changed",
              "yyyy-MM-dd'T'HH:mm:ssXXX"
          ).alias("last_changed"),
          F.to_timestamp(
              "row.last_updated",
              "yyyy-MM-dd'T'HH:mm:ssXXX"
          ).alias("last_updated"),
      ))

display(df.limit(10))

Extra tips worth knowing

If attributes can contain numbers or booleans, you have a couple of safe options:

  • Use .option("primitivesAsString", "true") so everything lands as strings and nothing silently becomes null.

  • Or widen the map type and normalize downstream once the data is flattened.

Also worth calling out: your original approach worked once you flattened the JSON externally because you removed that extra array level. The double-explode here is doing the same thing, just inside Spark where it belongs.

 

Hope this helps, Louis.

Thank you so much for your extensive explanation @Louis_Frolio! Now it makes complete sense.

Joost1024
Visitor

I guess I was a bit over enthusiastic by accepting the answer.

When I run the following on the single object array of arrays (as shown in the original post) I get a single row with column "value" and value null.

 

from pyspark.sql import functions as F, types as T

inner = T.StructType([
T.StructField("entity_id", T.StringType(), False),
T.StructField("state", T.StringType(), True),
T.StructField("attributes", T.MapType(T.StringType(), T.StringType()), True),
T.StructField("last_changed", T.StringType(), False),
T.StructField("last_updated", T.StringType(), False),
])

schema = T.StructType([
T.StructField("value", T.ArrayType(T.ArrayType(inner)), True)
])

df0 = (spark.read.format("json")
.option("multiLine", "true")
.option("primitivesAsString", "true")
.schema(schema)
.load("<S3 path>/original-single-item.json"))

display(df0)