cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Read JSON with backslash.

Orianh
Valued Contributor II

Hello guys.

I'm trying to read JSON file which contains backslash and failed to read it via pyspark.

Tried a lot of options but didn't solve this yet, I thought to read all the JSON as text and replace all "\" with "/" but pyspark fail to read it as text too.

example to json:

{

"fname": "max",

"lname" :" tom",

"path ": " c\\dir1\\dir2"

}

code that i tried:

df = spark.read.option('mode','PERMISSIVE').option('columnNameOfCorruptRecord', '_corrupt_record').json('path_to_json', multiLine=True)
 
df =  spark.read.text('path_to_json')

At the first code example when i don't specify the schema i get error unable to infer schema, and if i specify it i get Query returned no result.

At the second code example i get Query returned no result.

the path contains the JSON data , but because the path field pyspark fail to read it as valid json.

(If there is a way to drop the path field while reading the JSON i dont mind to do it, but didn't find any information on how to achieve that.)

Hope some one can help me out.

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions

Orianh
Valued Contributor II

I did with with boto3 instead with pyspark since its not a lot of files.

    jsons_data = []
    client = boto3.client('s3')
    s3_resource = boto3.resource('s3')
    bucket = s3_resource.Bucket(JARVIS_BUCKET)
    for obj in bucket.objects.filter(Prefix=prefix):
      file_name = obj.key
      if re.search(ANCHOR_PATTERN, file_name):
        json_obj = client.get_object(Bucket=JARVIS_BUCKET, Key=file_name)
        body = json_obj['Body']
        json_string = body.read().decode('utf-8')
        jsons_data.append(json_normalize(json.loads(json_string,strict=False)))
 
    df = pd.concat(jsons_data)

View solution in original post

7 REPLIES 7

Kaniz
Community Manager
Community Manager

Hi @ Orianh! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question first. Or else I will get back to you soon. Thanks.

Kaniz
Community Manager
Community Manager

Hi @Orianh,

This worked for me.

Screenshot 2021-10-18 at 7.43.

Please let me know if this helps.

%python
ds=spark.read.option("multiline","true").json("path_to_the_json_file")
ds.printSchema()
ds.show()

jose_gonzalez
Moderator
Moderator

hi @orian hindi​ ,

Please let us know if @Kaniz Fatma​ solution worked for you and selected as best answer. If not, please provide more details and we will help you to solve your error message.

Orianh
Valued Contributor II

Thanks for the respond, I managed to solve this by my self 😀

Kaniz
Community Manager
Community Manager

Awesome, Thanks.

Anonymous
Not applicable

@orian hindi​ - Would you be happy to post the solution you came up with and then mark it as best? That will help other members. 😎

Orianh
Valued Contributor II

I did with with boto3 instead with pyspark since its not a lot of files.

    jsons_data = []
    client = boto3.client('s3')
    s3_resource = boto3.resource('s3')
    bucket = s3_resource.Bucket(JARVIS_BUCKET)
    for obj in bucket.objects.filter(Prefix=prefix):
      file_name = obj.key
      if re.search(ANCHOR_PATTERN, file_name):
        json_obj = client.get_object(Bucket=JARVIS_BUCKET, Key=file_name)
        body = json_obj['Body']
        json_string = body.read().decode('utf-8')
        jsons_data.append(json_normalize(json.loads(json_string,strict=False)))
 
    df = pd.concat(jsons_data)

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.