cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Read JSON with backslash.

Orianh
Valued Contributor II

Hello guys.

I'm trying to read JSON file which contains backslash and failed to read it via pyspark.

Tried a lot of options but didn't solve this yet, I thought to read all the JSON as text and replace all "\" with "/" but pyspark fail to read it as text too.

example to json:

{

"fname": "max",

"lname" :" tom",

"path ": " c\\dir1\\dir2"

}

code that i tried:

df = spark.read.option('mode','PERMISSIVE').option('columnNameOfCorruptRecord', '_corrupt_record').json('path_to_json', multiLine=True)
 
df =  spark.read.text('path_to_json')

At the first code example when i don't specify the schema i get error unable to infer schema, and if i specify it i get Query returned no result.

At the second code example i get Query returned no result.

the path contains the JSON data , but because the path field pyspark fail to read it as valid json.

(If there is a way to drop the path field while reading the JSON i dont mind to do it, but didn't find any information on how to achieve that.)

Hope some one can help me out.

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions

Orianh
Valued Contributor II

I did with with boto3 instead with pyspark since its not a lot of files.

    jsons_data = []
    client = boto3.client('s3')
    s3_resource = boto3.resource('s3')
    bucket = s3_resource.Bucket(JARVIS_BUCKET)
    for obj in bucket.objects.filter(Prefix=prefix):
      file_name = obj.key
      if re.search(ANCHOR_PATTERN, file_name):
        json_obj = client.get_object(Bucket=JARVIS_BUCKET, Key=file_name)
        body = json_obj['Body']
        json_string = body.read().decode('utf-8')
        jsons_data.append(json_normalize(json.loads(json_string,strict=False)))
 
    df = pd.concat(jsons_data)

View solution in original post

4 REPLIES 4

jose_gonzalez
Databricks Employee
Databricks Employee

hi @orian hindi​ ,

Please let us know if @Kaniz Fatma​ solution worked for you and selected as best answer. If not, please provide more details and we will help you to solve your error message.

Orianh
Valued Contributor II

Thanks for the respond, I managed to solve this by my self 😀

Anonymous
Not applicable

@orian hindi​ - Would you be happy to post the solution you came up with and then mark it as best? That will help other members. 😎

Orianh
Valued Contributor II

I did with with boto3 instead with pyspark since its not a lot of files.

    jsons_data = []
    client = boto3.client('s3')
    s3_resource = boto3.resource('s3')
    bucket = s3_resource.Bucket(JARVIS_BUCKET)
    for obj in bucket.objects.filter(Prefix=prefix):
      file_name = obj.key
      if re.search(ANCHOR_PATTERN, file_name):
        json_obj = client.get_object(Bucket=JARVIS_BUCKET, Key=file_name)
        body = json_obj['Body']
        json_string = body.read().decode('utf-8')
        jsons_data.append(json_normalize(json.loads(json_string,strict=False)))
 
    df = pd.concat(jsons_data)

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group