10-17-2021 04:55 AM
Hello guys.
I'm trying to read JSON file which contains backslash and failed to read it via pyspark.
Tried a lot of options but didn't solve this yet, I thought to read all the JSON as text and replace all "\" with "/" but pyspark fail to read it as text too.
example to json:
{
"fname": "max",
"lname" :" tom",
"path ": " c\\dir1\\dir2"
}
code that i tried:
df = spark.read.option('mode','PERMISSIVE').option('columnNameOfCorruptRecord', '_corrupt_record').json('path_to_json', multiLine=True)
df = spark.read.text('path_to_json')
At the first code example when i don't specify the schema i get error unable to infer schema, and if i specify it i get Query returned no result.
At the second code example i get Query returned no result.
the path contains the JSON data , but because the path field pyspark fail to read it as valid json.
(If there is a way to drop the path field while reading the JSON i dont mind to do it, but didn't find any information on how to achieve that.)
Hope some one can help me out.
Thanks!
11-15-2021 12:58 AM
I did with with boto3 instead with pyspark since its not a lot of files.
jsons_data = []
client = boto3.client('s3')
s3_resource = boto3.resource('s3')
bucket = s3_resource.Bucket(JARVIS_BUCKET)
for obj in bucket.objects.filter(Prefix=prefix):
file_name = obj.key
if re.search(ANCHOR_PATTERN, file_name):
json_obj = client.get_object(Bucket=JARVIS_BUCKET, Key=file_name)
body = json_obj['Body']
json_string = body.read().decode('utf-8')
jsons_data.append(json_normalize(json.loads(json_string,strict=False)))
df = pd.concat(jsons_data)
10-21-2021 09:58 AM
hi @orian hindi ,
Please let us know if @Kaniz Fatma solution worked for you and selected as best answer. If not, please provide more details and we will help you to solve your error message.
11-11-2021 04:25 AM
Thanks for the respond, I managed to solve this by my self 😀
11-11-2021 08:48 AM
@orian hindi - Would you be happy to post the solution you came up with and then mark it as best? That will help other members. 😎
11-15-2021 12:58 AM
I did with with boto3 instead with pyspark since its not a lot of files.
jsons_data = []
client = boto3.client('s3')
s3_resource = boto3.resource('s3')
bucket = s3_resource.Bucket(JARVIS_BUCKET)
for obj in bucket.objects.filter(Prefix=prefix):
file_name = obj.key
if re.search(ANCHOR_PATTERN, file_name):
json_obj = client.get_object(Bucket=JARVIS_BUCKET, Key=file_name)
body = json_obj['Body']
json_string = body.read().decode('utf-8')
jsons_data.append(json_normalize(json.loads(json_string,strict=False)))
df = pd.concat(jsons_data)
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group