topic Re: Extracting data from a multi-layered JSON object in Data Engineering

Extracting data from a multi-layered JSON object

aschiff — Wed, 06 Jul 2022 19:00:19 GMT

I have a table in databricks called owner_final_delta with a column called contacts that holds data with this structure:

array<struct<address:struct<apartment:string,city:string,house:string,poBox:string,sources:array<string>,state:string,street:string,type:string,zip:string>,addresses:array<struct<apartment:string,city:string,house:string,lastSeen:string,poBox:string,sources:array<string>,state:string,street:string,type:string,zip:string>>,contactKey:string,emails:array<struct<emailId:string,lastSeen:string,sources:array<string>>>,lastModifiedDate:string,name:struct<firstNames:array<string>,lastNames:array<string>,middleNames:array<string>,salutations:array<string>,suffixes:array<string>>,phones:array<struct<extension:string,lastSeen:string,lineType:string,number:string,sources:array<string>,validSince:string>>,relationship:string,sources:array<string>>>

From this, I want to extract the emailId. I can extract contacts.emails which is an array that contains the emailId which itself can also be an array (if there are multiple emails tied to one record). The following is an example of one record returned from contacts.emails. A contact is akin to a business/company. So each element in the contacts.emails array is a person under that business/company. Furthermore, each person can have multiple emails (emailId).

array

0: [{"emailId": "emailA1@gmail.com", "lastSeen": "Oct 12, 2021 6:51:33 PM", "sources": ["REONOMY"]}, {"emailId": "emailA2@gmail.com", "lastSeen": "Oct 12, 2021 6:51:33 PM", "sources": ["REONOMY"]}, {"emailId": "emailA3@gmail.com", "lastSeen": "Oct 12, 2021 6:51:33 PM", "sources": ["REONOMY"]}]
1: [{"emailId": "emailB1@gmail.com", "lastSeen": "Oct 12, 2021 6:51:33 PM", "sources": ["SNL", "REONOMY"]}, {"emailId": "emailB2@gmail.com", "lastSeen": "Oct 12, 2021 6:51:33 PM", "sources": ["REONOMY"]}, {"emailId": "emailB3@gmail.com", "lastSeen": "Oct 12, 2021 6:51:33 PM", "sources": ["REONOMY"]}]
2: [{"emailId": "emailC1@gmail.com", "lastSeen": "Oct 12, 2021 6:51:33 PM", "sources": ["REONOMY"]}, {"emailId": "emailC2@gmail.com", "lastSeen": "Oct 12, 2021 6:51:33 PM", "sources": ["REONOMY"]}, {"emailId": "emailC3@gmail.com", "lastSeen": "Oct 12, 2021 6:51:33 PM", "sources": ["REONOMY"]}]

What I would like to achieve is a column of just the emailId where there is one emailId per row. For the above example, I would like this one record split into 9 rows, one for each emailId. I tried using get_json_object but must be doing something wrong.

select get_json_object(cast(contacts.emails as string), '$.emailId') as emailId from owner_final_delta

I tried the above query as well as other variations like using STR() or trying contacts.emails[0] or $contacts.emails and they all run into a compiler error or return null values. I would prefer a solution using SQL (so it can be easily utilized in Tableau), but any solution would work.

Re: Extracting data from a multi-layered JSON object

Dooley — Wed, 06 Jul 2022 20:27:43 GMT

Have you tried to use the explode function for that column with the array?

df.select(explode(df.emailId).alias("email")).show()

----------

Also, if you are a SQL lover, you can instead use the Databricks syntax for querying a JSON seen here.

Re: Extracting data from a multi-layered JSON object

aschiff — Wed, 06 Jul 2022 21:22:50 GMT

Thanks for your response! I tried the explode function (which is a pretty dangerous name for a function) and made sure to import pyspark.sql.functions and even made a new cluster with the most updated databricks runtime version, but I still get a "explode not defined" error. I even tried importing directly pyspark.sql.functions.explode but that model couldn't be found.

I then looked into the "Querying semi-structured data in SQL" documentation. I tried some of the methods there like raw:contacts.emails[*].emailId but got errors revolving around "raw column not existing" and when removing the "raw:" part I got an "invalid usage of '*'" error.

Re: Extracting data from a multi-layered JSON object

Dooley — Fri, 08 Jul 2022 20:57:21 GMT

Explode - Does this code below give you the same error?

from pyspark.sql import functions as F
from pyspark.sql import Row
eDF = spark.createDataFrame([Row(a=1, intlist=[1,2,3], mapfield={"a": "b"})])
eDF.select(F.explode(eDF.intlist).alias("anInt")).show()

For the SQL method, what is the column name in the table that holds this JSON structure in each row? Let's say that it is "contacts" and yet your JSON starts the nesting with "contacts" then it would be:

SELECT contacts:contacts.emails[*].emailId FROM table_name

Same error with this?

Re: Extracting data from a multi-layered JSON object

aschiff — Fri, 08 Jul 2022 22:10:06 GMT

I tried your exact explode example and it worked. I then plugged in my data like so:

df = sqlContext.table("owner_final_delta")

import pyspark.sql.functions as F

df.select(F.explode(df.contacts.emails[0].emailId).alias("email")).show()

This worked but notice I am using [0] (index 0 of the emails array). I tried using [*] there but I get an invalid syntax error. Is there a way to loop through all df.contacts.emails and return all the .emailIds in one column?

For the SQL method the column name holding the JSON structure is contacts. So I tried the query exactly as you have written it:

select contacts:contacts.emails[*].emailId from owner_final_delta

This returns this error essentially saying there is an argument type mismatch in that contacts is not a string:

Error in SQL statement: AnalysisException: cannot resolve 'semi_structured_extract_json_multi(spark_catalog.default.owner_final_delta.contacts, '$.contacts.emails[*].emailId')' due to data type mismatch: argument 1 requires string type, however, 'spark_catalog.default.owner_final_delta.contacts' is of array<struct<address:struct<apartment:string,city:string,house:string,poBox:string,sources:array<string>,state:string,street:string,type:string,zip:string>,addresses:array<struct<apartment:string,city:string,house:string,lastSeen:string,poBox:string,sources:array<string>,state:string,street:string,type:string,zip:string>>,contactKey:string,emails:array<struct<emailId:string,lastSeen:string,sources:array<string>>>,lastModifiedDate:string,name:struct<firstNames:array<string>,lastNames:array<string>,middleNames:array<string>,salutations:array<string>,suffixes:array<string>>,phones:array<struct<extension:string,lastSeen:string,lineType:string,number:string,sources:array<string>,validSince:string>>,relationship:string,sources:array<string>>> type.; line 1 pos 7;

'Project [semi_structured_extract_json_multi(spark_catalog.default.owner_final_delta.contacts, '$.contacts.emails[*].emailId') AS emailId#956]

How should this be resolved?

Re: Extracting data from a multi-layered JSON object

Dooley — Fri, 08 Jul 2022 22:53:01 GMT

df = sqlContext.table("owner_final_delta")
import pyspark.sql.functions as F
df.select(F.explode(df.contacts.emails).alias("email")).show()

If you explode only the emails, I'm thinking each row will be a new emailId in that category. What do you get with that?

-------

SQL side of life

If the column is "contacts" and emails is the first level in the nested json for emailId, I'm thinking you might have to leave off the second "contacts" I put in.

SELECT contacts:emails[*].emailId FROM owner_final_delta

Does this also give you an error?

Re: Extracting data from a multi-layered JSON object

aschiff — Sat, 09 Jul 2022 20:49:18 GMT

With your modified explode function, I am now getting the whole array of arrays of contacts.emails. It looks like [[emailId, date, source]]. There could be multiple in one row as well like [[emailId, date, source], [emailId, date, source], [emailId, date, source]]. So we still need to drill further.

As for the SQL command, I am still getting a type mismatch error seen below.

But when trying select contacts::string:emails[*].emailId from owner_final_delta I get an output but all the values are null. So again, I believe we are on the right track and just need to make some finer adjustments somewhere. And thanks for the continued help!

Re: Extracting data from a multi-layered JSON object

Dooley — Mon, 11 Jul 2022 18:20:21 GMT

I think you are right. We are close.

For Python, let's try exploding twice. If we have, [[emailId, date, source], [emailId, date, source], [emailId, date, source]] then let us explode that column out as well so each email ID has its own row.

df = sqlContext.table("owner_final_delta")
import pyspark.sql.functions as F
df.select(F.explode(F.explode(df.contacts.emails).alias("email")).alias("email_ids")).show()

And then for SQL, does this command give you errors as well? I

SELECT from_json(contacts:emails[*], 'array<array<string>>') emails FROM owner_final_delta

I believe emails is an array of an array of strings. I want to see if we can get here first without any errors before digging deeper into the nest.

Re: Extracting data from a multi-layered JSON object

aschiff — Mon, 11 Jul 2022 20:02:55 GMT

So I tried exploding twice and got the error: Generators are not supported when it's nested in expressions, but got: explode(explode(contacts.emails) AS email).

After doing some research, explode is a generator function that makes new columns so you can't use functions on explode. https://stackoverflow.com/questions/50125971/generators-are-not-supported-when-its-nested-in-expressions

I was also looking into the explode function in Pandas but don't know if this one would work or how to get the data formatted correctly to be a Pandas DataFrame.

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html

The SQL query gives me the same error of mismatched argument type.

Re: Extracting data from a multi-layered JSON object

Dooley — Mon, 11 Jul 2022 21:15:50 GMT

Is there any way you can post a small sample of the owner_final_delta table so I can try out some code on my side?

Re: Extracting data from a multi-layered JSON object

aschiff — Mon, 11 Jul 2022 23:01:32 GMT

Here you go. Good luck!

Re: Extracting data from a multi-layered JSON object

Dooley — Wed, 13 Jul 2022 16:15:35 GMT

This worked for me (see the column "emailIds" in the output):

%sql
select 
  explode(
    aggregate(
      contacts_parsed.emails,
      array(''),
      (acc,x)->array_union(acc,transform(x,(y,i)->y.emailId)),
      acc->array_remove(acc,'')
    )
   ) emailIds, 
   * from (
      select 
        from_json(contacts, """array<struct<address:struct<apartment:string,city:string,house:string,poBox:string,sources:array<string>,state:string,street:string,type:string,zip:string>,addresses:array<struct<apartment:string,city:string,house:string,lastSeen:string,poBox:string,sources:array<string>,state:string,street:string,type:string,zip:string>>,contactKey:string,emails:array<struct<emailId:string,lastSeen:string,sources:array<string>>>,lastModifiedDate:string,name:struct<firstNames:array<string>,lastNames:array<string>,middleNames:array<string>,salutations:array<string>,suffixes:array<string>>,phones:array<struct<extension:string,lastSeen:string,lineType:string,number:string,sources:array<string>,validSince:string>>,relationship:string,sources:array<string>>>""") contacts_parsed, 
        * 
      from owner_final_delta
)

Re: Extracting data from a multi-layered JSON object

aschiff — Wed, 13 Jul 2022 16:32:15 GMT

I tried your exact code:And still get the type mismatch error:

Re: Extracting data from a multi-layered JSON object

Dooley — Wed, 13 Jul 2022 17:01:36 GMT

This looks like it is not parsing which is interesting. Are you using the latest databricks runtime in your cluster?

Also, does this part give you errors?

select 
        from_json(contacts, """array<struct<address:struct<apartment:string,city:string,house:string,poBox:string,sources:array<string>,state:string,street:string,type:string,zip:string>,addresses:array<struct<apartment:string,city:string,house:string,lastSeen:string,poBox:string,sources:array<string>,state:string,street:string,type:string,zip:string>>,contactKey:string,emails:array<struct<emailId:string,lastSeen:string,sources:array<string>>>,lastModifiedDate:string,name:struct<firstNames:array<string>,lastNames:array<string>,middleNames:array<string>,salutations:array<string>,suffixes:array<string>>,phones:array<struct<extension:string,lastSeen:string,lineType:string,number:string,sources:array<string>,validSince:string>>,relationship:string,sources:array<string>>>""") contacts_parsed, 
        * 
      from owner_final_delta

If it parses correctly the output in that "contacts_parsed" column should show you an arrow on the top left part of each cell in that column that points to the right. You click it and it breaks down the pieces in the json file.

Re: Extracting data from a multi-layered JSON object

aschiff — Wed, 13 Jul 2022 17:11:05 GMT

My cluster is on databricks runtime 11.0, spark 3.3.0 and Scala 2.12 and yes that part gives me the error.

Re: Extracting data from a multi-layered JSON object

Dooley — Wed, 13 Jul 2022 17:22:21 GMT

So when I read it in, this is how I read it in. I'm changing the temporary view table name so it wont conflict with your other table.

Python:

df1 = spark.read.format("csv").option("header", "true").option("escape", "\"").option("multiLine",True).load("file_path_where_csv_file_is_located")
df1.createOrReplaceTempView("owner_final_delta1")
display(df1)

      select 
        from_json(contacts, """array<struct<address:struct<apartment:string,city:string,house:string,poBox:string,sources:array<string>,state:string,street:string,type:string,zip:string>,addresses:array<struct<apartment:string,city:string,house:string,lastSeen:string,poBox:string,sources:array<string>,state:string,street:string,type:string,zip:string>>,contactKey:string,emails:array<struct<emailId:string,lastSeen:string,sources:array<string>>>,lastModifiedDate:string,name:struct<firstNames:array<string>,lastNames:array<string>,middleNames:array<string>,salutations:array<string>,suffixes:array<string>>,phones:array<struct<extension:string,lastSeen:string,lineType:string,number:string,sources:array<string>,validSince:string>>,relationship:string,sources:array<string>>>""") contacts_parsed, 
        * 
      from owner_final_delta1

Then the last one is SQL.

Re: Extracting data from a multi-layered JSON object

aschiff — Wed, 13 Jul 2022 19:33:54 GMT

This appears to have worked! I first uploaded it as a csv but turns out data was lost so that didn't work. But then I saved it as a .xlsx file and uploaded that to databricks. I then made a table out of it and saw that all the attributes (contacts included) were of string type.

When running the above SQL command you posted, the output was returned with the contacts_parsed column and the little arrow allowing me to see the breakdown of the JSON object.

Then when running this as part of the larger query earlier with explode, I saw the emailIds column perfectly. My only remaining question is if this worked on contacts that had multiple emailIds. I don't know if there were any examples of this in the sample I gave you, but based on the solution you provided should it work on those contacts with multiple emailIds? "Working" would entail each row having one emailId, even if one contact has multiple emailIds, break them up so one is in each row of the emailIds column.

Re: Extracting data from a multi-layered JSON object

Dooley — Wed, 13 Jul 2022 20:14:15 GMT

It should with the explode function but the sample data I have did not have a case where there were more than one emailId in that array so you will have to do a group by and see if you see a count greater than 1 in your own larger data set. I did see there were plenty of empty emailIds and empty contact cells in our sample, so it does filter through those.

Re: Extracting data from a multi-layered JSON object

aschiff — Wed, 13 Jul 2022 20:24:37 GMT

That would be group by the _id and check the count greater than 1 right?

Re: Extracting data from a multi-layered JSON object

Dooley — Wed, 13 Jul 2022 20:33:41 GMT

Yup. But you have to make sure you have duplicates in the data to begin with. If you can't find one right away to do your test on, I would do a test where I add a row of synthetic data that has multiple emailIds within that array. Then when you do a SELECT on the parsed table version, you can do a WHERE on that ID number of your synthetic data to see if there are more than one row.