Databricks Community

amde99 · ‎04-11-2024

I have a situation where source files in .json.gz sometimes arrive with invalid syntax containing multiple roots separated by empty braces []. How can I detect this and thrown an exception? Currently the code runs and picks up only record set 1, and skips all the others without throwing any kind of exception.

Example of a bad source file (sorry for bad formatting, kept getting an invalid HTML error):

[

//...record set 1

"key": "value",
"key": "value",
"key": [
"value"
],
"key": "value",
"key": "value",
"key": "value"
},

][][

//... record set 2

][][][][

// ... record set 3

]

file read code:

df = spark.read \
    .format("json") \
    .load("<file path>.json.gz", multiLine=True)
print(df.count())

daniel_sahal · ‎04-19-2024

@amde99
Changing the mode to FAILFAST should be able to help you with throwing an exception.

https://spark.apache.org/docs/latest/sql-data-sources-json.html

Lakshay · ‎04-19-2024

Schema validation should help here.

Databricks Community

How can I throw an exception when a .json.gz file has multiple roots?

Connect with Databricks Users in Your Area

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences

Databricks Community Champion - December 2024 - Sujesh Menon