10 hours ago - last edited 10 hours ago
Hi Databricks Community!
This is my first post in this forum, so I hope you can forgive me if it's not according to the forum best practices 🙂
After lots of searching, I decided to share the peculiar issue I'm running into in this community.
I try to load a JSON format that is exposed via the Home Assistant /api/history/period/ endpoint. The format consists of an array of arrays:
[
[
{
"entity_id": "sensor.solaredge_lifetime_energy",
"state": "19848848.0",
"attributes": {
"state_class": "total",
"unit_of_measurement": "Wh",
"device_class": "energy",
"friendly_name": "solaredge Lifetime energy"
},
"last_changed": "2025-12-14T23:00:00+00:00",
"last_updated": "2025-12-14T23:00:00+00:00"
},
{ ... }
],
[ ... ]
]
Each array contains measurements of a specific sensor. Every object has the same 5 properties / fields.
The file(s) live in S3 and I try to read them as follows:
schema = StructType([
StructField("entity_id", StringType(), False),
StructField("state", StringType(), True),
StructField("attributes", MapType(StringType(), StringType()), True),
StructField("last_changed", TimestampType(), False),
StructField("last_updated", TimestampType(), False),
])
spark.read.format('json').option("multiLine", "true").schema(schema).load(<S3 location>).limit(10).display()
For some strange reason the result is 1 row per file with all null values, while the correct columns are there.
I tried / checked different things so far:
Any suggestions? I don't think I should transform the source JSON in order to be able to load it using Spark.
9 hours ago
Greetings @Joost1024 , I did some digging.
You’re running into a root type mismatch.
Your JSON’s top level is an array of arrays, but the schema you provided describes a single struct (one record). Spark can’t reconcile those two shapes, so it does what it always does in this situation: it gives you one row per file and fills the struct fields with nulls.
What’s happening
The file’s root is an array of arrays of structs — think [[{...}, {...}], [...]]. Your schema, however, describes only the inner object, not the outer container. When multiLine JSON is enabled, Spark treats each file as a single JSON value. Since the root type doesn’t match the schema, Spark can’t align the fields and you end up with nulls across the board.
The good news: you don’t need to change the source JSON. There are two clean ways to handle this directly in Spark.
Option A: Let Spark infer the nested arrays, then explode twice
Have Spark read the file as-is, infer the shape, and flatten it step by step:
from pyspark.sql import functions as F
df0 = (spark.read.format("json")
.option("multiLine", "true")
.load("<S3 location>"))
# df0 has a single column `value`: array<array<struct<entity_id,...>>>
df = (df0
.select(F.explode("value").alias("arr")) # array<struct>
.select(F.explode("arr").alias("row")) # struct
.select(
"row.entity_id",
F.col("row.state").alias("state"),
"row.attributes",
F.to_timestamp(
"row.last_changed",
"yyyy-MM-dd'T'HH:mm:ssXXX"
).alias("last_changed"),
F.to_timestamp(
"row.last_updated",
"yyyy-MM-dd'T'HH:mm:ssXXX"
).alias("last_updated"),
))
display(df.limit(10))
A couple of notes:
The timestamp pattern yyyy-MM-dd'T'HH:mm:ssXXX correctly handles ISO-8601 offsets like +00:00.
If state should be numeric, just cast it after the explode.
Option B: Define a schema that actually matches the root
Instead of fighting the JSON shape, describe it accurately: a single top-level field that is an array of arrays of structs.
from pyspark.sql import functions as F, types as T
inner = T.StructType([
T.StructField("entity_id", T.StringType(), False),
T.StructField("state", T.StringType(), True),
T.StructField("attributes", T.MapType(T.StringType(), T.StringType()), True),
T.StructField("last_changed", T.StringType(), False),
T.StructField("last_updated", T.StringType(), False),
])
schema = T.StructType([
T.StructField("value", T.ArrayType(T.ArrayType(inner)), True)
])
df0 = (spark.read.format("json")
.option("multiLine", "true")
# .option("primitivesAsString", "true") # optional, see notes below
.schema(schema)
.load("<S3 location>"))
df = (df0
.select(F.explode("value").alias("arr"))
.select(F.explode("arr").alias("row"))
.select(
"row.entity_id",
F.col("row.state").alias("state"),
"row.attributes",
F.to_timestamp(
"row.last_changed",
"yyyy-MM-dd'T'HH:mm:ssXXX"
).alias("last_changed"),
F.to_timestamp(
"row.last_updated",
"yyyy-MM-dd'T'HH:mm:ssXXX"
).alias("last_updated"),
))
display(df.limit(10))
Extra tips worth knowing
If attributes can contain numbers or booleans, you have a couple of safe options:
Use .option("primitivesAsString", "true") so everything lands as strings and nothing silently becomes null.
Or widen the map type and normalize downstream once the data is flattened.
Also worth calling out: your original approach worked once you flattened the JSON externally because you removed that extra array level. The double-explode here is doing the same thing, just inside Spark where it belongs.
Hope this helps, Louis.
8 hours ago
Thank you so much for your extensive explanation @Louis_Frolio! Now it makes complete sense.
8 hours ago
I guess I was a bit over enthusiastic by accepting the answer.
When I run the following on the single object array of arrays (as shown in the original post) I get a single row with column "value" and value null.
from pyspark.sql import functions as F, types as T
inner = T.StructType([
T.StructField("entity_id", T.StringType(), False),
T.StructField("state", T.StringType(), True),
T.StructField("attributes", T.MapType(T.StringType(), T.StringType()), True),
T.StructField("last_changed", T.StringType(), False),
T.StructField("last_updated", T.StringType(), False),
])
schema = T.StructType([
T.StructField("value", T.ArrayType(T.ArrayType(inner)), True)
])
df0 = (spark.read.format("json")
.option("multiLine", "true")
.option("primitivesAsString", "true")
.schema(schema)
.load("<S3 path>/original-single-item.json"))
display(df0)
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now