Re: Extracting data from a multi-layered JSON obje...

Dooley · ‎07-13-2022

So when I read it in, this is how I read it in. I'm changing the temporary view table name so it wont conflict with your other table.

Python:

df1 = spark.read.format("csv").option("header", "true").option("escape", "\"").option("multiLine",True).load("file_path_where_csv_file_is_located")
df1.createOrReplaceTempView("owner_final_delta1")
display(df1)

      select 
        from_json(contacts, """array<struct<address:struct<apartment:string,city:string,house:string,poBox:string,sources:array<string>,state:string,street:string,type:string,zip:string>,addresses:array<struct<apartment:string,city:string,house:string,lastSeen:string,poBox:string,sources:array<string>,state:string,street:string,type:string,zip:string>>,contactKey:string,emails:array<struct<emailId:string,lastSeen:string,sources:array<string>>>,lastModifiedDate:string,name:struct<firstNames:array<string>,lastNames:array<string>,middleNames:array<string>,salutations:array<string>,suffixes:array<string>>,phones:array<struct<extension:string,lastSeen:string,lineType:string,number:string,sources:array<string>,validSince:string>>,relationship:string,sources:array<string>>>""") contacts_parsed, 
        * 
      from owner_final_delta1

Then the last one is SQL.

View solution in original post

aschiff · ‎07-13-2022

This appears to have worked! I first uploaded it as a csv but turns out data was lost so that didn't work. But then I saved it as a .xlsx file and uploaded that to databricks. I then made a table out of it and saw that all the attributes (contacts included) were of string type.

When running the above SQL command you posted, the output was returned with the contacts_parsed column and the little arrow allowing me to see the breakdown of the JSON object.

Then when running this as part of the larger query earlier with explode, I saw the emailIds column perfectly. My only remaining question is if this worked on contacts that had multiple emailIds. I don't know if there were any examples of this in the sample I gave you, but based on the solution you provided should it work on those contacts with multiple emailIds? "Working" would entail each row having one emailId, even if one contact has multiple emailIds, break them up so one is in each row of the emailIds column.

Dooley · ‎07-13-2022

It should with the explode function but the sample data I have did not have a case where there were more than one emailId in that array so you will have to do a group by and see if you see a count greater than 1 in your own larger data set. I did see there were plenty of empty emailIds and empty contact cells in our sample, so it does filter through those.

aschiff · ‎07-13-2022

That would be group by the _id and check the count greater than 1 right?

Dooley · ‎07-13-2022

Yup. But you have to make sure you have duplicates in the data to begin with. If you can't find one right away to do your test on, I would do a test where I add a row of synthetic data that has multiple emailIds within that array. Then when you do a SELECT on the parsed table version, you can do a WHERE on that ID number of your synthetic data to see if there are more than one row.

aschiff · ‎07-13-2022

Ok, thanks again.😁 Will let you know if I run into any other issues on this topic. I need to also do this for the phone numbers so I will see if I can get it working for that as well.

aschiff · ‎07-14-2022

Yes explode does handle the duplicates/contacts with multiple emailIds

Dooley · ‎07-14-2022

🙌

aschiff · ‎07-14-2022

Is it possible for me to have your email address to ask you questions directly?

aschiff · ‎07-14-2022

It turns out I need to do this also for the sources of each emailId and simply changing emailId to sources in the solution doesn't work. I think it has something to do with emailId being a string and sources being an array of strings. If you know how to easily modify your solution to access the sources in a similar fashion then that would be very helpful.