topic Re: Read JSON files from the s3 bucket in Data Engineering

Read JSON files from the s3 bucket

Orianh — Thu, 14 Oct 2021 08:59:31 GMT

Hello guys, I'm trying to read JSON files from the s3 bucket. but no matter what I try I get Query returned no result or if I don't specify the schema I get unable to infer a schema.

I tried to mount the s3 bucket, still not works.

here is some code that I tried:

df = spark.read.json('dbfs:/mnt/path_to_json', multiLine="true", schema= json_schema) 
 
df = spark.read.option('multiline','true').format('json').load(path_to_json)
 
df = spark.read.json('s3a:// path_to _json, multiline=True)
 
display(df)

The json file look like this:

{

'key1' : 'value1',

'key2' : 'value2',

...

}

hope you guys can help me,

Thanks!

**EDIT**: inside the JSON i have string value that contains " \ " which throw corrupted error, is there any way to overcome this without change the value for the specific key?

Re: Read JSON files from the s3 bucket

Prabakar — Thu, 14 Oct 2021 10:26:25 GMT

Please try the below code and let me know if it helps you.

%scala
val mdf = spark.read.option("multiline", "true").json("s3://<path-to-jsonfile>/sample.json")
mdf.show(false)

Re: Read JSON files from the s3 bucket

Orianh — Thu, 14 Oct 2021 10:31:10 GMT

Thanks for your answer, I get unable to infer a schema error.

error :

org.apache.spark.sql.AnalysisException: Unable to infer schema for JSON. It must be specified manually.

tired s3:// and s3a:// -- both didn't work.

Re: Read JSON files from the s3 bucket

Hubert-Dudek — Thu, 14 Oct 2021 10:34:40 GMT

Please verify json in some online json validator. Try double quotes in json - had issue with single quotes that one time.

Your code examples are correct.

Re: Read JSON files from the s3 bucket

Prabakar — Thu, 14 Oct 2021 10:42:37 GMT

Please refer to the doc that helps you to read JSON.

If you are getting this error the problem should be with the JSON schema. Please validate it.

As a test, create a simple JSON file (you can get it on the internet), upload it to your S3 bucket, and try to read that. If it works then your JSON file schema has to be checked.

Further, the methods that you tried should also work if the JSON format is valid.

Re: Read JSON files from the s3 bucket

Orianh — Thu, 14 Oct 2021 10:44:13 GMT

the json is valid. when i tried to write a json file in fs and then read it evrey thing went fine.

dbutils.fs.put("/tmp/test.json", """
{"string":"string1",
"int":1,
"array":[1,2,3],
"dict": {"key": "value1"}}
""", True)
 
df = spark.read.json('/tmp/test.json')

but when tried to read from s3 bucket, or from mount its failed

Re: Read JSON files from the s3 bucket

Hubert-Dudek — Thu, 14 Oct 2021 10:55:02 GMT

other ideas:

validate location and file existance for example using "data" on left menu in databricks,
validate S3 access rights (aws admin attach policy to user/role maybe something is missing),
try read that as text file to check is content loading:

spark.read.text()

Re: Read JSON files from the s3 bucket

Orianh — Thu, 14 Oct 2021 11:24:00 GMT

I wrote the real json inside /tmp/test.json and tried to read it now.

when i didn't defined the schema i got an error:

Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the

referenced columns only include the internal corrupt record column

(named _corrupt_record by default). For example:

spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()

and spark.read.schema(schema).json(file).select("_corrupt_record").show().

Instead, you can cache or save the parsed results and then send the same query.

For example, val df = spark.read.schema(schema).json(file).cache() and then

df.filter($"_corrupt_record".isNotNull).count().;

but when i defined the schema i got a df with all columns null.

i have access to s3 bucket, since i already read text files from there and the json files have data inside it ( 800 KB)

Thanks a lot for your help

Re: Read JSON files from the s3 bucket

Orianh — Thu, 14 Oct 2021 11:51:59 GMT

I think i found the problem, inside the json i have a string value that contains '\'

and its throw corrupted error, any idea how to overcome on this without change all the json files?

Re: Read JSON files from the s3 bucket

Hubert-Dudek — Thu, 14 Oct 2021 11:59:55 GMT

try to experiment with this options:

df = spark.read\
.option("mode", "PERMISSIVE")\
.option("columnNameOfCorruptRecord", "_corrupt_record")\
.json(...

Re: Read JSON files from the s3 bucket

Orianh — Thu, 14 Oct 2021 12:36:50 GMT

still not working -- same corrupted error. I uploaded to s3 bucket same JSON just without the problematic value and every thing went well.

Re: Read JSON files from the s3 bucket

Hubert-Dudek — Thu, 14 Oct 2021 13:20:53 GMT

so last effort is just replace '\' in files like you do. You can do that programmatically before reading json.