Databricks Community

Orianh · ‎10-17-2021

Hello guys.

I'm trying to read JSON file which contains backslash and failed to read it via pyspark.

Tried a lot of options but didn't solve this yet, I thought to read all the JSON as text and replace all "\" with "/" but pyspark fail to read it as text too.

example to json:

{

"fname": "max",

"lname" :" tom",

"path ": " c\\dir1\\dir2"

}

code that i tried:

df = spark.read.option('mode','PERMISSIVE').option('columnNameOfCorruptRecord', '_corrupt_record').json('path_to_json', multiLine=True)
 
df =  spark.read.text('path_to_json')

At the first code example when i don't specify the schema i get error unable to infer schema, and if i specify it i get Query returned no result.

At the second code example i get Query returned no result.

the path contains the JSON data , but because the path field pyspark fail to read it as valid json.

(If there is a way to drop the path field while reading the JSON i dont mind to do it, but didn't find any information on how to achieve that.)

Hope some one can help me out.

Thanks!

Orianh · ‎11-15-2021

I did with with boto3 instead with pyspark since its not a lot of files.

    jsons_data = []
    client = boto3.client('s3')
    s3_resource = boto3.resource('s3')
    bucket = s3_resource.Bucket(JARVIS_BUCKET)
    for obj in bucket.objects.filter(Prefix=prefix):
      file_name = obj.key
      if re.search(ANCHOR_PATTERN, file_name):
        json_obj = client.get_object(Bucket=JARVIS_BUCKET, Key=file_name)
        body = json_obj['Body']
        json_string = body.read().decode('utf-8')
        jsons_data.append(json_normalize(json.loads(json_string,strict=False)))
 
    df = pd.concat(jsons_data)

View solution in original post

jose_gonzalez · ‎10-21-2021

hi @orian hindi ,

Please let us know if @Kaniz Fatma solution worked for you and selected as best answer. If not, please provide more details and we will help you to solve your error message.

Orianh · ‎11-11-2021

Thanks for the respond, I managed to solve this by my self 😀

Anonymous · ‎11-11-2021

@orian hindi - Would you be happy to post the solution you came up with and then mark it as best? That will help other members. 😎

Orianh · ‎11-15-2021

I did with with boto3 instead with pyspark since its not a lot of files.

    jsons_data = []
    client = boto3.client('s3')
    s3_resource = boto3.resource('s3')
    bucket = s3_resource.Bucket(JARVIS_BUCKET)
    for obj in bucket.objects.filter(Prefix=prefix):
      file_name = obj.key
      if re.search(ANCHOR_PATTERN, file_name):
        json_obj = client.get_object(Bucket=JARVIS_BUCKET, Key=file_name)
        body = json_obj['Body']
        json_string = body.read().decode('utf-8')
        jsons_data.append(json_normalize(json.loads(json_string,strict=False)))
 
    df = pd.concat(jsons_data)

Databricks Community

Read JSON with backslash.

Photos

Join Us as a Local Community Builder!

Exciting Opportunity to Collaborate with Us!

Intelligent Data Warehousing: AI/BI for Self-service Analytics

Share Your Thoughts on Databricks & Get Rewarded!

Get Started With Lakehouse Architecture | Pass a quiz to earn your certificate completion.

Virtual Learning Festival: 9 April - 30 April