07-06-2022 12:00 PM
I have a table in databricks called owner_final_delta with a column called contacts that holds data with this structure:
array<struct<address:struct<apartment:string,city:string,house:string,poBox:string,sources:array<string>,state:string,street:string,type:string,zip:string>,addresses:array<struct<apartment:string,city:string,house:string,lastSeen:string,poBox:string,sources:array<string>,state:string,street:string,type:string,zip:string>>,contactKey:string,emails:array<struct<emailId:string,lastSeen:string,sources:array<string>>>,lastModifiedDate:string,name:struct<firstNames:array<string>,lastNames:array<string>,middleNames:array<string>,salutations:array<string>,suffixes:array<string>>,phones:array<struct<extension:string,lastSeen:string,lineType:string,number:string,sources:array<string>,validSince:string>>,relationship:string,sources:array<string>>>
From this, I want to extract the emailId. I can extract contacts.emails which is an array that contains the emailId which itself can also be an array (if there are multiple emails tied to one record). The following is an example of one record returned from contacts.emails. A contact is akin to a business/company. So each element in the contacts.emails array is a person under that business/company. Furthermore, each person can have multiple emails (emailId).
array
What I would like to achieve is a column of just the emailId where there is one emailId per row. For the above example, I would like this one record split into 9 rows, one for each emailId. I tried using get_json_object but must be doing something wrong.
select get_json_object(cast(contacts.emails as string), '$.emailId') as emailId from owner_final_delta
I tried the above query as well as other variations like using STR() or trying contacts.emails[0] or $contacts.emails and they all run into a compiler error or return null values. I would prefer a solution using SQL (so it can be easily utilized in Tableau), but any solution would work.
07-13-2022 10:22 AM
So when I read it in, this is how I read it in. I'm changing the temporary view table name so it wont conflict with your other table.
Python:
df1 = spark.read.format("csv").option("header", "true").option("escape", "\"").option("multiLine",True).load("file_path_where_csv_file_is_located")
df1.createOrReplaceTempView("owner_final_delta1")
display(df1)
select
from_json(contacts, """array<struct<address:struct<apartment:string,city:string,house:string,poBox:string,sources:array<string>,state:string,street:string,type:string,zip:string>,addresses:array<struct<apartment:string,city:string,house:string,lastSeen:string,poBox:string,sources:array<string>,state:string,street:string,type:string,zip:string>>,contactKey:string,emails:array<struct<emailId:string,lastSeen:string,sources:array<string>>>,lastModifiedDate:string,name:struct<firstNames:array<string>,lastNames:array<string>,middleNames:array<string>,salutations:array<string>,suffixes:array<string>>,phones:array<struct<extension:string,lastSeen:string,lineType:string,number:string,sources:array<string>,validSince:string>>,relationship:string,sources:array<string>>>""") contacts_parsed,
*
from owner_final_delta1
Then the last one is SQL.
07-13-2022 12:33 PM
This appears to have worked! I first uploaded it as a csv but turns out data was lost so that didn't work. But then I saved it as a .xlsx file and uploaded that to databricks. I then made a table out of it and saw that all the attributes (contacts included) were of string type.
When running the above SQL command you posted, the output was returned with the contacts_parsed column and the little arrow allowing me to see the breakdown of the JSON object.
Then when running this as part of the larger query earlier with explode, I saw the emailIds column perfectly. My only remaining question is if this worked on contacts that had multiple emailIds. I don't know if there were any examples of this in the sample I gave you, but based on the solution you provided should it work on those contacts with multiple emailIds? "Working" would entail each row having one emailId, even if one contact has multiple emailIds, break them up so one is in each row of the emailIds column.
07-13-2022 01:14 PM
It should with the explode function but the sample data I have did not have a case where there were more than one emailId in that array so you will have to do a group by and see if you see a count greater than 1 in your own larger data set. I did see there were plenty of empty emailIds and empty contact cells in our sample, so it does filter through those.
07-13-2022 01:24 PM
That would be group by the _id and check the count greater than 1 right?
07-13-2022 01:33 PM
Yup. But you have to make sure you have duplicates in the data to begin with. If you can't find one right away to do your test on, I would do a test where I add a row of synthetic data that has multiple emailIds within that array. Then when you do a SELECT on the parsed table version, you can do a WHERE on that ID number of your synthetic data to see if there are more than one row.
07-13-2022 01:50 PM
Ok, thanks again.😁 Will let you know if I run into any other issues on this topic. I need to also do this for the phone numbers so I will see if I can get it working for that as well.
07-14-2022 01:43 PM
Yes explode does handle the duplicates/contacts with multiple emailIds
07-14-2022 01:50 PM
🙌
07-14-2022 03:02 PM
Is it possible for me to have your email address to ask you questions directly?
07-14-2022 03:55 PM
It turns out I need to do this also for the sources of each emailId and simply changing emailId to sources in the solution doesn't work. I think it has something to do with emailId being a string and sources being an array of strings. If you know how to easily modify your solution to access the sources in a similar fashion then that would be very helpful.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group