Databricks Community

aschiff · ‎07-06-2022

I have a table in databricks called owner_final_delta with a column called contacts that holds data with this structure:

array<struct<address:struct<apartment:string,city:string,house:string,poBox:string,sources:array<string>,state:string,street:string,type:string,zip:string>,addresses:array<struct<apartment:string,city:string,house:string,lastSeen:string,poBox:string,sources:array<string>,state:string,street:string,type:string,zip:string>>,contactKey:string,emails:array<struct<emailId:string,lastSeen:string,sources:array<string>>>,lastModifiedDate:string,name:struct<firstNames:array<string>,lastNames:array<string>,middleNames:array<string>,salutations:array<string>,suffixes:array<string>>,phones:array<struct<extension:string,lastSeen:string,lineType:string,number:string,sources:array<string>,validSince:string>>,relationship:string,sources:array<string>>>

From this, I want to extract the emailId. I can extract contacts.emails which is an array that contains the emailId which itself can also be an array (if there are multiple emails tied to one record). The following is an example of one record returned from contacts.emails. A contact is akin to a business/company. So each element in the contacts.emails array is a person under that business/company. Furthermore, each person can have multiple emails (emailId).

array

0: [{"emailId": "emailA1@gmail.com", "lastSeen": "Oct 12, 2021 6:51:33 PM", "sources": ["REONOMY"]}, {"emailId": "emailA2@gmail.com", "lastSeen": "Oct 12, 2021 6:51:33 PM", "sources": ["REONOMY"]}, {"emailId": "emailA3@gmail.com", "lastSeen": "Oct 12, 2021 6:51:33 PM", "sources": ["REONOMY"]}]
1: [{"emailId": "emailB1@gmail.com", "lastSeen": "Oct 12, 2021 6:51:33 PM", "sources": ["SNL", "REONOMY"]}, {"emailId": "emailB2@gmail.com", "lastSeen": "Oct 12, 2021 6:51:33 PM", "sources": ["REONOMY"]}, {"emailId": "emailB3@gmail.com", "lastSeen": "Oct 12, 2021 6:51:33 PM", "sources": ["REONOMY"]}]
2: [{"emailId": "emailC1@gmail.com", "lastSeen": "Oct 12, 2021 6:51:33 PM", "sources": ["REONOMY"]}, {"emailId": "emailC2@gmail.com", "lastSeen": "Oct 12, 2021 6:51:33 PM", "sources": ["REONOMY"]}, {"emailId": "emailC3@gmail.com", "lastSeen": "Oct 12, 2021 6:51:33 PM", "sources": ["REONOMY"]}]

What I would like to achieve is a column of just the emailId where there is one emailId per row. For the above example, I would like this one record split into 9 rows, one for each emailId. I tried using get_json_object but must be doing something wrong.

select get_json_object(cast(contacts.emails as string), '$.emailId') as emailId from owner_final_delta

I tried the above query as well as other variations like using STR() or trying contacts.emails[0] or $contacts.emails and they all run into a compiler error or return null values. I would prefer a solution using SQL (so it can be easily utilized in Tableau), but any solution would work.

Dooley · ‎07-13-2022

So when I read it in, this is how I read it in. I'm changing the temporary view table name so it wont conflict with your other table.

Python:

df1 = spark.read.format("csv").option("header", "true").option("escape", "\"").option("multiLine",True).load("file_path_where_csv_file_is_located")
df1.createOrReplaceTempView("owner_final_delta1")
display(df1)

      select 
        from_json(contacts, """array<struct<address:struct<apartment:string,city:string,house:string,poBox:string,sources:array<string>,state:string,street:string,type:string,zip:string>,addresses:array<struct<apartment:string,city:string,house:string,lastSeen:string,poBox:string,sources:array<string>,state:string,street:string,type:string,zip:string>>,contactKey:string,emails:array<struct<emailId:string,lastSeen:string,sources:array<string>>>,lastModifiedDate:string,name:struct<firstNames:array<string>,lastNames:array<string>,middleNames:array<string>,salutations:array<string>,suffixes:array<string>>,phones:array<struct<extension:string,lastSeen:string,lineType:string,number:string,sources:array<string>,validSince:string>>,relationship:string,sources:array<string>>>""") contacts_parsed, 
        * 
      from owner_final_delta1

Then the last one is SQL.

aschiff · ‎07-13-2022

This appears to have worked! I first uploaded it as a csv but turns out data was lost so that didn't work. But then I saved it as a .xlsx file and uploaded that to databricks. I then made a table out of it and saw that all the attributes (contacts included) were of string type.

When running the above SQL command you posted, the output was returned with the contacts_parsed column and the little arrow allowing me to see the breakdown of the JSON object.

Then when running this as part of the larger query earlier with explode, I saw the emailIds column perfectly. My only remaining question is if this worked on contacts that had multiple emailIds. I don't know if there were any examples of this in the sample I gave you, but based on the solution you provided should it work on those contacts with multiple emailIds? "Working" would entail each row having one emailId, even if one contact has multiple emailIds, break them up so one is in each row of the emailIds column.

Dooley · ‎07-13-2022

It should with the explode function but the sample data I have did not have a case where there were more than one emailId in that array so you will have to do a group by and see if you see a count greater than 1 in your own larger data set. I did see there were plenty of empty emailIds and empty contact cells in our sample, so it does filter through those.

aschiff · ‎07-13-2022

That would be group by the _id and check the count greater than 1 right?

Dooley · ‎07-13-2022

Yup. But you have to make sure you have duplicates in the data to begin with. If you can't find one right away to do your test on, I would do a test where I add a row of synthetic data that has multiple emailIds within that array. Then when you do a SELECT on the parsed table version, you can do a WHERE on that ID number of your synthetic data to see if there are more than one row.

aschiff · ‎07-13-2022

Ok, thanks again.😁 Will let you know if I run into any other issues on this topic. I need to also do this for the phone numbers so I will see if I can get it working for that as well.

aschiff · ‎07-14-2022

Yes explode does handle the duplicates/contacts with multiple emailIds

Dooley · ‎07-14-2022

🙌

aschiff · ‎07-14-2022

Is it possible for me to have your email address to ask you questions directly?

aschiff · ‎07-14-2022

It turns out I need to do this also for the sources of each emailId and simply changing emailId to sources in the solution doesn't work. I think it has something to do with emailId being a string and sources being an array of strings. If you know how to easily modify your solution to access the sources in a similar fashion then that would be very helpful.

Databricks Community

Extracting data from a multi-layered JSON object

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks

Databricks Community Champion - December 2024 - Sujesh Menon

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences