Extracting data from a multi-layered JSON object

aschiff
Contributor II

I have a table in databricks called owner_final_delta with a column called contacts that holds data with this structure:

array<struct<address:struct<apartment:string,city:string,house:string,poBox:string,sources:array<string>,state:string,street:string,type:string,zip:string>,addresses:array<struct<apartment:string,city:string,house:string,lastSeen:string,poBox:string,sources:array<string>,state:string,street:string,type:string,zip:string>>,contactKey:string,emails:array<struct<emailId:string,lastSeen:string,sources:array<string>>>,lastModifiedDate:string,name:struct<firstNames:array<string>,lastNames:array<string>,middleNames:array<string>,salutations:array<string>,suffixes:array<string>>,phones:array<struct<extension:string,lastSeen:string,lineType:string,number:string,sources:array<string>,validSince:string>>,relationship:string,sources:array<string>>>

From this, I want to extract the emailId. I can extract contacts.emails which is an array that contains the emailId which itself can also be an array (if there are multiple emails tied to one record). The following is an example of one record returned from contacts.emails. A contact is akin to a business/company. So each element in the contacts.emails array is a person under that business/company. Furthermore, each person can have multiple emails (emailId).

array

  • 0: [{"emailId": "emailA1@gmail.com", "lastSeen": "Oct 12, 2021 6:51:33 PM", "sources": ["REONOMY"]}, {"emailId": "emailA2@gmail.com", "lastSeen": "Oct 12, 2021 6:51:33 PM", "sources": ["REONOMY"]}, {"emailId": "emailA3@gmail.com", "lastSeen": "Oct 12, 2021 6:51:33 PM", "sources": ["REONOMY"]}]
  • 1: [{"emailId": "emailB1@gmail.com", "lastSeen": "Oct 12, 2021 6:51:33 PM", "sources": ["SNL", "REONOMY"]}, {"emailId": "emailB2@gmail.com", "lastSeen": "Oct 12, 2021 6:51:33 PM", "sources": ["REONOMY"]}, {"emailId": "emailB3@gmail.com", "lastSeen": "Oct 12, 2021 6:51:33 PM", "sources": ["REONOMY"]}]
  • 2: [{"emailId": "emailC1@gmail.com", "lastSeen": "Oct 12, 2021 6:51:33 PM", "sources": ["REONOMY"]}, {"emailId": "emailC2@gmail.com", "lastSeen": "Oct 12, 2021 6:51:33 PM", "sources": ["REONOMY"]}, {"emailId": "emailC3@gmail.com", "lastSeen": "Oct 12, 2021 6:51:33 PM", "sources": ["REONOMY"]}]

What I would like to achieve is a column of just the emailId where there is one emailId per row. For the above example, I would like this one record split into 9 rows, one for each emailId. I tried using get_json_object but must be doing something wrong.

select get_json_object(cast(contacts.emails as string), '$.emailId') as emailId from owner_final_delta

I tried the above query as well as other variations like using STR() or trying contacts.emails[0] or $contacts.emails and they all run into a compiler error or return null values. I would prefer a solution using SQL (so it can be easily utilized in Tableau), but any solution would work.