How can I throw an exception when a .json.gz file has multiple roots?

amde99
New Contributor

I have a situation where source files in .json.gz sometimes arrive with invalid syntax containing multiple roots separated by empty braces []. How can I detect this and thrown an exception? Currently the code runs and picks up only record set 1, and skips all the others without throwing any kind of exception.

Example of a bad source file (sorry for bad formatting, kept getting an invalid HTML error):

[

//...record set 1

     "key": "value",
     "key": "value",
     "key": [
          "value"
      ],
      "key": "value",
      "key": "value",
      "key": "value"
},

][][

//... record set 2

][][][][

// ... record set 3

]

file read code:

 

 

df = spark.read \
    .format("json") \
    .load("<file path>.json.gz", multiLine=True)
print(df.count())