Hi
I'm importing a large collection of json files, the problem is that they are not what I would expect a well-formatted json file to be (although probably still valid), each file consists of only a single record that looks something like this (this is just an abstraction)
[{"name":"myName","Surname":"MySurname"},[{"address":"1","Type":"Home"},{"address":"2","Type":"Home"}],[{"Tel":"1"},{"Tel":"2"}]]
I would ideally like to import it using the standard json read option bus can't figure out how to structure the schema.
My first approach involved creating a UDF that imported the record as a string and returned a new properly formatted object
def structure(object):
structure = {}
structure["name"] = object[0]["name"]
structure["Surname"] = object[0]["Surname"]
structure["addresses"] = []
structure["telephones"] = []
for address in object[1]:
structure["addresses"].append({"address": address["address"],"Type": address["Type"]})
for telephones in object[2]:
structure["telephones"].append({"Tel": telephones["Tel"]})
return structure
This works but it will be slower and less intuitive.
using a schema like this "works" but the elements in the ID'd come back as null
mySchema = StructType([
StructField("Name", StringType(), True),
StructField("Surname", StringType(), True),
StrutType(StructField("IDs",
StructType([StructField("ID", StringType(), True)])
)
)
])