Importing irregularly formatted json files

MikeJohnsonZa
New Contributor

Hi

I'm importing a large collection of json files, the problem is that they are not what I would expect a well-formatted json file to be (although probably still valid), each file consists of only a single record that looks something like this (this is just an abstraction)

[{"name":"myName","Surname":"MySurname"},[{"address":"1","Type":"Home"},{"address":"2","Type":"Home"}],[{"Tel":"1"},{"Tel":"2"}]]

I would ideally like to import it using the standard json read option bus can't figure out how to structure the schema.

My first approach involved creating a UDF that imported the record as a string and returned a new properly formatted object

def structure(object):

  structure = {}

  structure["name"] = object[0]["name"]

  structure["Surname"] = object[0]["Surname"]

  structure["addresses"] = []

  structure["telephones"] = []

  for address in object[1]:

    structure["addresses"].append({"address": address["address"],"Type": address["Type"]})

  for telephones in object[2]:

    structure["telephones"].append({"Tel": telephones["Tel"]})

  return structure

This works but it will be slower and less intuitive.

using a schema like this "works" but the elements in the ID'd come back as null

mySchema = StructType([

          StructField("Name", StringType(), True),

          StructField("Surname", StringType(), True),

          StrutType(StructField("IDs",

            StructType([StructField("ID", StringType(), True)])

          )

             

      )

   ])