Databricks

Orianh · ‎10-14-2021

Hello guys, I'm trying to read JSON files from the s3 bucket. but no matter what I try I get Query returned no result or if I don't specify the schema I get unable to infer a schema.

I tried to mount the s3 bucket, still not works.

here is some code that I tried:

df = spark.read.json('dbfs:/mnt/path_to_json', multiLine="true", schema= json_schema) 
 
df = spark.read.option('multiline','true').format('json').load(path_to_json)
 
df = spark.read.json('s3a:// path_to _json, multiline=True)
 
display(df)

The json file look like this:

{

'key1' : 'value1',

'key2' : 'value2',

...

}

hope you guys can help me,

Thanks!

**EDIT**: inside the JSON i have string value that contains " \ " which throw corrupted error, is there any way to overcome this without change the value for the specific key?

Orianh · ‎10-14-2021

I think i found the problem, inside the json i have a string value that contains '\'

and its throw corrupted error, any idea how to overcome on this without change all the json files?

View solution in original post

Prabakar · ‎10-14-2021

Please try the below code and let me know if it helps you.

%scala
val mdf = spark.read.option("multiline", "true").json("s3://<path-to-jsonfile>/sample.json")
mdf.show(false)

Orianh · ‎10-14-2021

Thanks for your answer, I get unable to infer a schema error.

error :

org.apache.spark.sql.AnalysisException: Unable to infer schema for JSON. It must be specified manually.

tired s3:// and s3a:// -- both didn't work.

Hubert-Dudek · ‎10-14-2021

Please verify json in some online json validator. Try double quotes in json - had issue with single quotes that one time.

Your code examples are correct.

Orianh · ‎10-14-2021

the json is valid. when i tried to write a json file in fs and then read it evrey thing went fine.

dbutils.fs.put("/tmp/test.json", """
{"string":"string1",
"int":1,
"array":[1,2,3],
"dict": {"key": "value1"}}
""", True)
 
df = spark.read.json('/tmp/test.json')

but when tried to read from s3 bucket, or from mount its failed

Hubert-Dudek · ‎10-14-2021

other ideas:

validate location and file existance for example using "data" on left menu in databricks,
validate S3 access rights (aws admin attach policy to user/role maybe something is missing),
try read that as text file to check is content loading:

spark.read.text()

Orianh · ‎10-14-2021

I wrote the real json inside /tmp/test.json and tried to read it now.

when i didn't defined the schema i got an error:

Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the

referenced columns only include the internal corrupt record column

(named _corrupt_record by default). For example:

spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()

and spark.read.schema(schema).json(file).select("_corrupt_record").show().

Instead, you can cache or save the parsed results and then send the same query.

For example, val df = spark.read.schema(schema).json(file).cache() and then

df.filter($"_corrupt_record".isNotNull).count().;

but when i defined the schema i got a df with all columns null.

i have access to s3 bucket, since i already read text files from there and the json files have data inside it ( 800 KB)

Thanks a lot for your help

Prabakar · ‎10-14-2021

Please refer to the doc that helps you to read JSON.

If you are getting this error the problem should be with the JSON schema. Please validate it.

As a test, create a simple JSON file (you can get it on the internet), upload it to your S3 bucket, and try to read that. If it works then your JSON file schema has to be checked.

Further, the methods that you tried should also work if the JSON format is valid.

Orianh · ‎10-14-2021

I think i found the problem, inside the json i have a string value that contains '\'

and its throw corrupted error, any idea how to overcome on this without change all the json files?

Hubert-Dudek · ‎10-14-2021

try to experiment with this options:

df = spark.read\
.option("mode", "PERMISSIVE")\
.option("columnNameOfCorruptRecord", "_corrupt_record")\
.json(...

Orianh · ‎10-14-2021

still not working -- same corrupted error. I uploaded to s3 bucket same JSON just without the problematic value and every thing went well.

Hubert-Dudek · ‎10-14-2021

so last effort is just replace '\' in files like you do. You can do that programmatically before reading json.

Databricks

Read JSON files from the s3 bucket

Unity Catalog Lakeguard: Industry-first and only data governance for multi-user Apache™ Spark cluste

Announcing the General Availability of Databricks Asset Bundles

Register now and save 50% on training at Data + AI Summit!

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs