How can I throw an exception when a .json.gz file has multiple roots?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-11-2024 01:04 PM - edited 04-11-2024 01:05 PM
I have a situation where source files in .json.gz sometimes arrive with invalid syntax containing multiple roots separated by empty braces []. How can I detect this and thrown an exception? Currently the code runs and picks up only record set 1, and skips all the others without throwing any kind of exception.
Example of a bad source file (sorry for bad formatting, kept getting an invalid HTML error):
[
//...record set 1
"key": "value",
"key": "value",
"key": [
"value"
],
"key": "value",
"key": "value",
"key": "value"
},
][][
//... record set 2
][][][][
// ... record set 3
]
file read code:
df = spark.read \
.format("json") \
.load("<file path>.json.gz", multiLine=True)
print(df.count())
- Labels:
-
Spark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-19-2024 12:53 AM
@amde99
Changing the mode to FAILFAST should be able to help you with throwing an exception.
https://spark.apache.org/docs/latest/sql-data-sources-json.html
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-19-2024 04:17 AM
Schema validation should help here.

