cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How can I throw an exception when a .json.gz file has multiple roots?

amde99
New Contributor

I have a situation where source files in .json.gz sometimes arrive with invalid syntax containing multiple roots separated by empty braces []. How can I detect this and thrown an exception? Currently the code runs and picks up only record set 1, and skips all the others without throwing any kind of exception.

Example of a bad source file (sorry for bad formatting, kept getting an invalid HTML error):

[

//...record set 1

     "key": "value",
     "key": "value",
     "key": [
          "value"
      ],
      "key": "value",
      "key": "value",
      "key": "value"
},

][][

//... record set 2

][][][][

// ... record set 3

]

file read code:

 

 

df = spark.read \
    .format("json") \
    .load("<file path>.json.gz", multiLine=True)
print(df.count())

 

 

 

 

2 REPLIES 2

daniel_sahal
Esteemed Contributor

@amde99 
Changing the mode to FAILFAST should be able to help you with throwing an exception.

 

daniel_sahal_0-1713513154148.png

https://spark.apache.org/docs/latest/sql-data-sources-json.html

Lakshay
Esteemed Contributor
Esteemed Contributor

Schema validation should help here.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.