cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Read JSON files from the s3 bucket

Orianh
Valued Contributor II

Hello guys, I'm trying to read JSON files from the s3 bucket. but no matter what I try I get Query returned no result or if I don't specify the schema I get unable to infer a schema.

I tried to mount the s3 bucket, still not works.

here is some code that I tried:

df = spark.read.json('dbfs:/mnt/path_to_json', multiLine="true", schema= json_schema) 
 
df = spark.read.option('multiline','true').format('json').load(path_to_json)
 
df = spark.read.json('s3a:// path_to _json, multiline=True)
 
display(df)

The json file look like this:

{

'key1' : 'value1',

'key2' : 'value2',

...

}

hope you guys can help me,

Thanks!

 **EDIT**: inside the JSON i have string value that contains " \ " which throw corrupted error, is there any way to overcome this without change the value for the specific key?

1 ACCEPTED SOLUTION

Accepted Solutions

Orianh
Valued Contributor II

I think i found the problem, inside the json i have a string value that contains '\'

and its throw corrupted error, any idea how to overcome on this without change all the json files?

View solution in original post

11 REPLIES 11

Prabakar
Esteemed Contributor III
Esteemed Contributor III

Please try the below code and let me know if it helps you.

%scala
val mdf = spark.read.option("multiline", "true").json("s3://<path-to-jsonfile>/sample.json")
mdf.show(false)

Orianh
Valued Contributor II

Thanks for your answer, I get unable to infer a schema error.

error :

org.apache.spark.sql.AnalysisException: Unable to infer schema for JSON. It must be specified manually.

tired s3:// and s3a:// -- both didn't work.

Hubert-Dudek
Esteemed Contributor III

Please verify json in some online json validator. Try double quotes in json - had issue with single quotes that one time.

Your code examples are correct.

Orianh
Valued Contributor II

the json is valid. when i tried to write a json file in fs and then read it evrey thing went fine.

dbutils.fs.put("/tmp/test.json", """
{"string":"string1",
"int":1,
"array":[1,2,3],
"dict": {"key": "value1"}}
""", True)
 
df = spark.read.json('/tmp/test.json')

but when tried to read from s3 bucket, or from mount its failed

Hubert-Dudek
Esteemed Contributor III

other ideas:

  • validate location and file existance for example using "data" on left menu in databricks,
  • validate S3 access rights (aws admin attach policy to user/role maybe something is missing),
  • try read that as text file to check is content loading:
spark.read.text()

Orianh
Valued Contributor II

I wrote the real json inside /tmp/test.json and tried to read it now.

when i didn't defined the schema i got an error:

Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the

referenced columns only include the internal corrupt record column

(named _corrupt_record by default). For example:

spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()

and spark.read.schema(schema).json(file).select("_corrupt_record").show().

Instead, you can cache or save the parsed results and then send the same query.

For example, val df = spark.read.schema(schema).json(file).cache() and then

df.filter($"_corrupt_record".isNotNull).count().;

but when i defined the schema i got a df with all columns null.

i have access to s3 bucket, since i already read text files from there and the json files have data inside it ( 800 KB)

Thanks a lot for your help

Prabakar
Esteemed Contributor III
Esteemed Contributor III

Please refer to the doc that helps you to read JSON.

If you are getting this error the problem should be with the JSON schema. Please validate it.

As a test, create a simple JSON file (you can get it on the internet), upload it to your S3 bucket, and try to read that. If it works then your JSON file schema has to be checked.

Further, the methods that you tried should also work if the JSON format is valid.

Orianh
Valued Contributor II

I think i found the problem, inside the json i have a string value that contains '\'

and its throw corrupted error, any idea how to overcome on this without change all the json files?

Hubert-Dudek
Esteemed Contributor III

try to experiment with this options:

df = spark.read\
.option("mode", "PERMISSIVE")\
.option("columnNameOfCorruptRecord", "_corrupt_record")\
.json(...

Orianh
Valued Contributor II

still not working -- same corrupted error. I uploaded to s3 bucket same JSON just without the problematic value and every thing went well.

Hubert-Dudek
Esteemed Contributor III

so last effort is just replace '\' in files like you do. You can do that programmatically before reading json.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.