cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

How can I throw an exception when a .json.gz file has multiple roots?

amde99
New Contributor

I have a situation where source files in .json.gz sometimes arrive with invalid syntax containing multiple roots separated by empty braces []. How can I detect this and thrown an exception? Currently the code runs and picks up only record set 1, and skips all the others without throwing any kind of exception.

Example of a bad source file (sorry for bad formatting, kept getting an invalid HTML error):

[

//...record set 1

     "key": "value",
     "key": "value",
     "key": [
          "value"
      ],
      "key": "value",
      "key": "value",
      "key": "value"
},

][][

//... record set 2

][][][][

// ... record set 3

]

file read code:

 

 

df = spark.read \
    .format("json") \
    .load("<file path>.json.gz", multiLine=True)
print(df.count())

 

 

 

 

2 REPLIES 2

daniel_sahal
Esteemed Contributor

@amde99 
Changing the mode to FAILFAST should be able to help you with throwing an exception.

 

daniel_sahal_0-1713513154148.png

https://spark.apache.org/docs/latest/sql-data-sources-json.html

Lakshay
Esteemed Contributor
Esteemed Contributor

Schema validation should help here.