Joost1024
New Contributor III

I guess I was a bit over enthusiastic by accepting the answer.

When I run the following on the single object array of arrays (as shown in the original post) I get a single row with column "value" and value null.

 

from pyspark.sql import functions as F, types as T

inner = T.StructType([
T.StructField("entity_id", T.StringType(), False),
T.StructField("state", T.StringType(), True),
T.StructField("attributes", T.MapType(T.StringType(), T.StringType()), True),
T.StructField("last_changed", T.StringType(), False),
T.StructField("last_updated", T.StringType(), False),
])

schema = T.StructType([
T.StructField("value", T.ArrayType(T.ArrayType(inner)), True)
])

df0 = (spark.read.format("json")
.option("multiLine", "true")
.option("primitivesAsString", "true")
.schema(schema)
.load("<S3 path>/original-single-item.json"))

display(df0)