- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-17-2021 04:55 AM
Hello guys.
I'm trying to read JSON file which contains backslash and failed to read it via pyspark.
Tried a lot of options but didn't solve this yet, I thought to read all the JSON as text and replace all "\" with "/" but pyspark fail to read it as text too.
example to json:
{
"fname": "max",
"lname" :" tom",
"path ": " c\\dir1\\dir2"
}
code that i tried:
df = spark.read.option('mode','PERMISSIVE').option('columnNameOfCorruptRecord', '_corrupt_record').json('path_to_json', multiLine=True)
df = spark.read.text('path_to_json')
At the first code example when i don't specify the schema i get error unable to infer schema, and if i specify it i get Query returned no result.
At the second code example i get Query returned no result.
the path contains the JSON data , but because the path field pyspark fail to read it as valid json.
(If there is a way to drop the path field while reading the JSON i dont mind to do it, but didn't find any information on how to achieve that.)
Hope some one can help me out.
Thanks!
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-15-2021 12:58 AM
I did with with boto3 instead with pyspark since its not a lot of files.
jsons_data = []
client = boto3.client('s3')
s3_resource = boto3.resource('s3')
bucket = s3_resource.Bucket(JARVIS_BUCKET)
for obj in bucket.objects.filter(Prefix=prefix):
file_name = obj.key
if re.search(ANCHOR_PATTERN, file_name):
json_obj = client.get_object(Bucket=JARVIS_BUCKET, Key=file_name)
body = json_obj['Body']
json_string = body.read().decode('utf-8')
jsons_data.append(json_normalize(json.loads(json_string,strict=False)))
df = pd.concat(jsons_data)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-21-2021 09:58 AM
hi @orian hindi ,
Please let us know if @Kaniz Fatma solution worked for you and selected as best answer. If not, please provide more details and we will help you to solve your error message.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-11-2021 04:25 AM
Thanks for the respond, I managed to solve this by my self 😀
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-11-2021 08:48 AM
@orian hindi - Would you be happy to post the solution you came up with and then mark it as best? That will help other members. 😎
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-15-2021 12:58 AM
I did with with boto3 instead with pyspark since its not a lot of files.
jsons_data = []
client = boto3.client('s3')
s3_resource = boto3.resource('s3')
bucket = s3_resource.Bucket(JARVIS_BUCKET)
for obj in bucket.objects.filter(Prefix=prefix):
file_name = obj.key
if re.search(ANCHOR_PATTERN, file_name):
json_obj = client.get_object(Bucket=JARVIS_BUCKET, Key=file_name)
body = json_obj['Body']
json_string = body.read().decode('utf-8')
jsons_data.append(json_normalize(json.loads(json_string,strict=False)))
df = pd.concat(jsons_data)

