<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Extracting data from a multi-layered JSON object in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/extracting-data-from-a-multi-layered-json-object/m-p/14535#M9007</link>
    <description>&lt;P&gt;I tried your exact code:&lt;span class="lia-inline-image-display-wrapper" image-alt="image"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1725i008A978181D00060/image-size/large?v=v2&amp;amp;px=999" role="button" title="image" alt="image" /&gt;&lt;/span&gt;And still get the type mismatch error:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="image"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1729iACDD9239B3690656/image-size/large?v=v2&amp;amp;px=999" role="button" title="image" alt="image" /&gt;&lt;/span&gt;&lt;/P&gt;</description>
    <pubDate>Wed, 13 Jul 2022 16:32:15 GMT</pubDate>
    <dc:creator>aschiff</dc:creator>
    <dc:date>2022-07-13T16:32:15Z</dc:date>
    <item>
      <title>Extracting data from a multi-layered JSON object</title>
      <link>https://community.databricks.com/t5/data-engineering/extracting-data-from-a-multi-layered-json-object/m-p/14523#M8995</link>
      <description>&lt;P&gt;I have a table in databricks called owner_final_delta with a column called contacts that holds data with this structure:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;array&amp;lt;struct&amp;lt;address:struct&amp;lt;apartment:string,city:string,house:string,poBox:string,sources:array&amp;lt;string&amp;gt;,state:string,street:string,type:string,zip:string&amp;gt;,addresses:array&amp;lt;struct&amp;lt;apartment:string,city:string,house:string,lastSeen:string,poBox:string,sources:array&amp;lt;string&amp;gt;,state:string,street:string,type:string,zip:string&amp;gt;&amp;gt;,contactKey:string,emails:array&amp;lt;struct&amp;lt;emailId:string,lastSeen:string,sources:array&amp;lt;string&amp;gt;&amp;gt;&amp;gt;,lastModifiedDate:string,name:struct&amp;lt;firstNames:array&amp;lt;string&amp;gt;,lastNames:array&amp;lt;string&amp;gt;,middleNames:array&amp;lt;string&amp;gt;,salutations:array&amp;lt;string&amp;gt;,suffixes:array&amp;lt;string&amp;gt;&amp;gt;,phones:array&amp;lt;struct&amp;lt;extension:string,lastSeen:string,lineType:string,number:string,sources:array&amp;lt;string&amp;gt;,validSince:string&amp;gt;&amp;gt;,relationship:string,sources:array&amp;lt;string&amp;gt;&amp;gt;&amp;gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;From this, I want to extract the emailId. I can extract contacts.emails which is an array that contains the emailId which itself can also be an array (if there are multiple emails tied to one record). The following is an example of one record returned from contacts.emails. A contact is akin to a business/company. So each element in the contacts.emails array is a person under that business/company. Furthermore, each person can have multiple emails (emailId). &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;array&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;0:&amp;nbsp;[{"emailId": "emailA1@gmail.com", "lastSeen": "Oct 12, 2021 6:51:33 PM", "sources": ["REONOMY"]}, {"emailId": "emailA2@gmail.com", "lastSeen": "Oct 12, 2021 6:51:33 PM", "sources": ["REONOMY"]}, {"emailId": "emailA3@gmail.com", "lastSeen": "Oct 12, 2021 6:51:33 PM", "sources": ["REONOMY"]}]&lt;/LI&gt;&lt;LI&gt;1:&amp;nbsp;[{"emailId": "emailB1@gmail.com", "lastSeen": "Oct 12, 2021 6:51:33 PM", "sources": ["SNL", "REONOMY"]}, {"emailId": "emailB2@gmail.com", "lastSeen": "Oct 12, 2021 6:51:33 PM", "sources": ["REONOMY"]}, {"emailId": "emailB3@gmail.com", "lastSeen": "Oct 12, 2021 6:51:33 PM", "sources": ["REONOMY"]}]&lt;/LI&gt;&lt;LI&gt;2:&amp;nbsp;[{"emailId": "emailC1@gmail.com", "lastSeen": "Oct 12, 2021 6:51:33 PM", "sources": ["REONOMY"]}, {"emailId": "emailC2@gmail.com", "lastSeen": "Oct 12, 2021 6:51:33 PM", "sources": ["REONOMY"]}, {"emailId": "emailC3@gmail.com", "lastSeen": "Oct 12, 2021 6:51:33 PM", "sources": ["REONOMY"]}]&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;What I would like to achieve is a column of just the emailId where there is one emailId per row. For the above example, I would like this one record split into 9 rows, one for each emailId. I tried using get_json_object but must be doing something wrong. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;select get_json_object(cast(contacts.emails as string), '$.emailId') as emailId from owner_final_delta&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I tried the above query as well as other variations like using STR() or trying contacts.emails[0] or $contacts.emails and they all run into a compiler error or return null values. I would prefer a solution using SQL (so it can be easily utilized in Tableau), but any solution would work. &lt;/P&gt;</description>
      <pubDate>Wed, 06 Jul 2022 19:00:19 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/extracting-data-from-a-multi-layered-json-object/m-p/14523#M8995</guid>
      <dc:creator>aschiff</dc:creator>
      <dc:date>2022-07-06T19:00:19Z</dc:date>
    </item>
    <item>
      <title>Re: Extracting data from a multi-layered JSON object</title>
      <link>https://community.databricks.com/t5/data-engineering/extracting-data-from-a-multi-layered-json-object/m-p/14524#M8996</link>
      <description>&lt;P&gt;Have you tried to use the &lt;A href="https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.functions.explode.html" alt="https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.functions.explode.html" target="_blank"&gt;explode function &lt;/A&gt;for that column with the array?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;df.select(explode(df.emailId).alias("email")).show()&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;----------&lt;/P&gt;&lt;P&gt;Also, if you are a SQL lover, you can instead use the Databricks syntax for querying a JSON seen &lt;A href="https://docs.databricks.com/spark/latest/spark-sql/semi-structured.html" alt="https://docs.databricks.com/spark/latest/spark-sql/semi-structured.html" target="_blank"&gt;here&lt;/A&gt;.&lt;/P&gt;</description>
      <pubDate>Wed, 06 Jul 2022 20:27:43 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/extracting-data-from-a-multi-layered-json-object/m-p/14524#M8996</guid>
      <dc:creator>Dooley</dc:creator>
      <dc:date>2022-07-06T20:27:43Z</dc:date>
    </item>
    <item>
      <title>Re: Extracting data from a multi-layered JSON object</title>
      <link>https://community.databricks.com/t5/data-engineering/extracting-data-from-a-multi-layered-json-object/m-p/14525#M8997</link>
      <description>&lt;P&gt;Thanks for your response! I tried the explode function (which is a pretty dangerous name for a function) and made sure to import pyspark.sql.functions and even made a new cluster with the most updated databricks runtime version, but I still get a "explode not defined" error. I even tried importing directly pyspark.sql.functions.explode but that model couldn't be found.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I then looked into the "Querying semi-structured data in SQL" documentation. I tried some of the methods there like raw:contacts.emails[*].emailId but got errors revolving around "raw column not existing" and when removing the "raw:" part I got an "invalid usage of '*'" error.&lt;/P&gt;</description>
      <pubDate>Wed, 06 Jul 2022 21:22:50 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/extracting-data-from-a-multi-layered-json-object/m-p/14525#M8997</guid>
      <dc:creator>aschiff</dc:creator>
      <dc:date>2022-07-06T21:22:50Z</dc:date>
    </item>
    <item>
      <title>Re: Extracting data from a multi-layered JSON object</title>
      <link>https://community.databricks.com/t5/data-engineering/extracting-data-from-a-multi-layered-json-object/m-p/14526#M8998</link>
      <description>&lt;P&gt;Explode - Does this code below give you the same error?&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;from pyspark.sql import functions as F
from pyspark.sql import Row
eDF = spark.createDataFrame([Row(a=1, intlist=[1,2,3], mapfield={"a": "b"})])
eDF.select(F.explode(eDF.intlist).alias("anInt")).show()&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;For the SQL method, what is the column name in the table that holds this JSON structure in each row? Let's say that it is "contacts" and yet your JSON starts the nesting with "contacts" then it would be:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;SELECT contacts:contacts.emails[*].emailId FROM table_name&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;Same error with this?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 08 Jul 2022 20:57:21 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/extracting-data-from-a-multi-layered-json-object/m-p/14526#M8998</guid>
      <dc:creator>Dooley</dc:creator>
      <dc:date>2022-07-08T20:57:21Z</dc:date>
    </item>
    <item>
      <title>Re: Extracting data from a multi-layered JSON object</title>
      <link>https://community.databricks.com/t5/data-engineering/extracting-data-from-a-multi-layered-json-object/m-p/14527#M8999</link>
      <description>&lt;P&gt;I tried your exact explode example and it worked. I then plugged in my data like so:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;df = sqlContext.table("owner_final_delta")&lt;/P&gt;&lt;P&gt;import pyspark.sql.functions as F&lt;/P&gt;&lt;P&gt;df.select(F.explode(df.contacts.emails[0].emailId).alias("email")).show()&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;This worked but notice I am using [0] (index 0 of the emails array). I tried using [*] there but I get an invalid syntax error. Is there a way to loop through all df.contacts.emails and return all the .emailIds in one column?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;For the SQL method the column name holding the JSON structure is contacts. So I tried the query exactly as you have written it:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;select contacts:contacts.emails[*].emailId from owner_final_delta&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;This returns this error essentially saying there is an argument type mismatch in that contacts is not a string: &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Error in SQL statement: AnalysisException: cannot resolve 'semi_structured_extract_json_multi(spark_catalog.default.owner_final_delta.contacts, '$.contacts.emails[*].emailId')' due to data type mismatch: argument 1 requires string type, however, 'spark_catalog.default.owner_final_delta.contacts' is of array&amp;lt;struct&amp;lt;address:struct&amp;lt;apartment:string,city:string,house:string,poBox:string,sources:array&amp;lt;string&amp;gt;,state:string,street:string,type:string,zip:string&amp;gt;,addresses:array&amp;lt;struct&amp;lt;apartment:string,city:string,house:string,lastSeen:string,poBox:string,sources:array&amp;lt;string&amp;gt;,state:string,street:string,type:string,zip:string&amp;gt;&amp;gt;,contactKey:string,emails:array&amp;lt;struct&amp;lt;emailId:string,lastSeen:string,sources:array&amp;lt;string&amp;gt;&amp;gt;&amp;gt;,lastModifiedDate:string,name:struct&amp;lt;firstNames:array&amp;lt;string&amp;gt;,lastNames:array&amp;lt;string&amp;gt;,middleNames:array&amp;lt;string&amp;gt;,salutations:array&amp;lt;string&amp;gt;,suffixes:array&amp;lt;string&amp;gt;&amp;gt;,phones:array&amp;lt;struct&amp;lt;extension:string,lastSeen:string,lineType:string,number:string,sources:array&amp;lt;string&amp;gt;,validSince:string&amp;gt;&amp;gt;,relationship:string,sources:array&amp;lt;string&amp;gt;&amp;gt;&amp;gt; type.; line 1 pos 7;&lt;/P&gt;&lt;P&gt;'Project [semi_structured_extract_json_multi(spark_catalog.default.owner_final_delta.contacts, '$.contacts.emails[*].emailId') AS emailId#956]&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;How should this be resolved?&lt;/P&gt;</description>
      <pubDate>Fri, 08 Jul 2022 22:10:06 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/extracting-data-from-a-multi-layered-json-object/m-p/14527#M8999</guid>
      <dc:creator>aschiff</dc:creator>
      <dc:date>2022-07-08T22:10:06Z</dc:date>
    </item>
    <item>
      <title>Re: Extracting data from a multi-layered JSON object</title>
      <link>https://community.databricks.com/t5/data-engineering/extracting-data-from-a-multi-layered-json-object/m-p/14528#M9000</link>
      <description>&lt;PRE&gt;&lt;CODE&gt;df = sqlContext.table("owner_final_delta")
import pyspark.sql.functions as F
df.select(F.explode(df.contacts.emails).alias("email")).show()&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;If you explode only the emails, I'm thinking each row will be a new emailId in that category. What do you get with that?&lt;/P&gt;&lt;P&gt;-------&lt;/P&gt;&lt;P&gt;SQL side of life&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;If the column is "contacts" and emails is the first level in the nested json for emailId, I'm thinking you might have to leave off the second "contacts" I put in. &lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;SELECT contacts:emails[*].emailId FROM owner_final_delta&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;Does this also give you an error?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 08 Jul 2022 22:53:01 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/extracting-data-from-a-multi-layered-json-object/m-p/14528#M9000</guid>
      <dc:creator>Dooley</dc:creator>
      <dc:date>2022-07-08T22:53:01Z</dc:date>
    </item>
    <item>
      <title>Re: Extracting data from a multi-layered JSON object</title>
      <link>https://community.databricks.com/t5/data-engineering/extracting-data-from-a-multi-layered-json-object/m-p/14529#M9001</link>
      <description>&lt;P&gt;With your modified explode function, I am now getting the whole array of arrays of contacts.emails. It looks like [[emailId, date, source]]. There could be multiple in one row as well like [[emailId, date, source], [emailId, date, source], [emailId, date, source]]. So we still need to drill further.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;As for the SQL command, I am still getting a type mismatch error seen below. &lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="image"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1735i36FA0C5543B50809/image-size/large?v=v2&amp;amp;px=999" role="button" title="image" alt="image" /&gt;&lt;/span&gt;But when trying select contacts::string:emails[*].emailId from owner_final_delta I get an output but all the values are null. So again, I believe we are on the right track and just need to make some finer adjustments somewhere. And thanks for the continued help!&lt;/P&gt;</description>
      <pubDate>Sat, 09 Jul 2022 20:49:18 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/extracting-data-from-a-multi-layered-json-object/m-p/14529#M9001</guid>
      <dc:creator>aschiff</dc:creator>
      <dc:date>2022-07-09T20:49:18Z</dc:date>
    </item>
    <item>
      <title>Re: Extracting data from a multi-layered JSON object</title>
      <link>https://community.databricks.com/t5/data-engineering/extracting-data-from-a-multi-layered-json-object/m-p/14530#M9002</link>
      <description>&lt;P&gt;I think you are right. We are close.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;For Python, let's try exploding twice. If we have,&amp;nbsp;[[emailId, date, source], [emailId, date, source], [emailId, date, source]] then let us explode that column out as well so each email ID has its own row.&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;df = sqlContext.table("owner_final_delta")
import pyspark.sql.functions as F
df.select(F.explode(F.explode(df.contacts.emails).alias("email")).alias("email_ids")).show()&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;And then for SQL, does this command give you errors as well? I&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;SELECT from_json(contacts:emails[*], 'array&amp;lt;array&amp;lt;string&amp;gt;&amp;gt;') emails FROM owner_final_delta&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;I believe emails is an array of an array of strings. I want to see if we can get here first without any errors before digging deeper into the nest.&lt;/P&gt;</description>
      <pubDate>Mon, 11 Jul 2022 18:20:21 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/extracting-data-from-a-multi-layered-json-object/m-p/14530#M9002</guid>
      <dc:creator>Dooley</dc:creator>
      <dc:date>2022-07-11T18:20:21Z</dc:date>
    </item>
    <item>
      <title>Re: Extracting data from a multi-layered JSON object</title>
      <link>https://community.databricks.com/t5/data-engineering/extracting-data-from-a-multi-layered-json-object/m-p/14531#M9003</link>
      <description>&lt;P&gt;So I tried exploding twice and got the error: Generators are not supported when it's nested in expressions, but got: explode(explode(contacts.emails) AS email).&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;After doing some research, explode is a generator function that makes new columns so you can't use functions on explode. &lt;A href="https://stackoverflow.com/questions/50125971/generators-are-not-supported-when-its-nested-in-expressions" target="test_blank"&gt;https://stackoverflow.com/questions/50125971/generators-are-not-supported-when-its-nested-in-expressions&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I was also looking into the explode function in Pandas but don't know if this one would work or how to get the data formatted correctly to be a Pandas DataFrame. &lt;/P&gt;&lt;P&gt;&lt;A href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html" target="test_blank"&gt;https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The SQL query gives me the same error of mismatched argument type. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="image"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1723i4789C8FE920F18CE/image-size/large?v=v2&amp;amp;px=999" role="button" title="image" alt="image" /&gt;&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 11 Jul 2022 20:02:55 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/extracting-data-from-a-multi-layered-json-object/m-p/14531#M9003</guid>
      <dc:creator>aschiff</dc:creator>
      <dc:date>2022-07-11T20:02:55Z</dc:date>
    </item>
    <item>
      <title>Re: Extracting data from a multi-layered JSON object</title>
      <link>https://community.databricks.com/t5/data-engineering/extracting-data-from-a-multi-layered-json-object/m-p/14532#M9004</link>
      <description>&lt;P&gt;Is there any way you can post a small sample of the owner_final_delta table so I can try out some code on my side? &lt;/P&gt;</description>
      <pubDate>Mon, 11 Jul 2022 21:15:50 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/extracting-data-from-a-multi-layered-json-object/m-p/14532#M9004</guid>
      <dc:creator>Dooley</dc:creator>
      <dc:date>2022-07-11T21:15:50Z</dc:date>
    </item>
    <item>
      <title>Re: Extracting data from a multi-layered JSON object</title>
      <link>https://community.databricks.com/t5/data-engineering/extracting-data-from-a-multi-layered-json-object/m-p/14533#M9005</link>
      <description>&lt;P&gt;Here you go. Good luck!&lt;/P&gt;</description>
      <pubDate>Mon, 11 Jul 2022 23:01:32 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/extracting-data-from-a-multi-layered-json-object/m-p/14533#M9005</guid>
      <dc:creator>aschiff</dc:creator>
      <dc:date>2022-07-11T23:01:32Z</dc:date>
    </item>
    <item>
      <title>Re: Extracting data from a multi-layered JSON object</title>
      <link>https://community.databricks.com/t5/data-engineering/extracting-data-from-a-multi-layered-json-object/m-p/14534#M9006</link>
      <description>&lt;P&gt;This worked for me (see the column "emailIds" in the output):&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;%sql
select 
  explode(
    aggregate(
      contacts_parsed.emails,
      array(''),
      (acc,x)-&amp;gt;array_union(acc,transform(x,(y,i)-&amp;gt;y.emailId)),
      acc-&amp;gt;array_remove(acc,'')
    )
   ) emailIds, 
   * from (
      select 
        from_json(contacts, """array&amp;lt;struct&amp;lt;address:struct&amp;lt;apartment:string,city:string,house:string,poBox:string,sources:array&amp;lt;string&amp;gt;,state:string,street:string,type:string,zip:string&amp;gt;,addresses:array&amp;lt;struct&amp;lt;apartment:string,city:string,house:string,lastSeen:string,poBox:string,sources:array&amp;lt;string&amp;gt;,state:string,street:string,type:string,zip:string&amp;gt;&amp;gt;,contactKey:string,emails:array&amp;lt;struct&amp;lt;emailId:string,lastSeen:string,sources:array&amp;lt;string&amp;gt;&amp;gt;&amp;gt;,lastModifiedDate:string,name:struct&amp;lt;firstNames:array&amp;lt;string&amp;gt;,lastNames:array&amp;lt;string&amp;gt;,middleNames:array&amp;lt;string&amp;gt;,salutations:array&amp;lt;string&amp;gt;,suffixes:array&amp;lt;string&amp;gt;&amp;gt;,phones:array&amp;lt;struct&amp;lt;extension:string,lastSeen:string,lineType:string,number:string,sources:array&amp;lt;string&amp;gt;,validSince:string&amp;gt;&amp;gt;,relationship:string,sources:array&amp;lt;string&amp;gt;&amp;gt;&amp;gt;""") contacts_parsed, 
        * 
      from owner_final_delta
)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 13 Jul 2022 16:15:35 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/extracting-data-from-a-multi-layered-json-object/m-p/14534#M9006</guid>
      <dc:creator>Dooley</dc:creator>
      <dc:date>2022-07-13T16:15:35Z</dc:date>
    </item>
    <item>
      <title>Re: Extracting data from a multi-layered JSON object</title>
      <link>https://community.databricks.com/t5/data-engineering/extracting-data-from-a-multi-layered-json-object/m-p/14535#M9007</link>
      <description>&lt;P&gt;I tried your exact code:&lt;span class="lia-inline-image-display-wrapper" image-alt="image"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1725i008A978181D00060/image-size/large?v=v2&amp;amp;px=999" role="button" title="image" alt="image" /&gt;&lt;/span&gt;And still get the type mismatch error:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="image"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1729iACDD9239B3690656/image-size/large?v=v2&amp;amp;px=999" role="button" title="image" alt="image" /&gt;&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 13 Jul 2022 16:32:15 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/extracting-data-from-a-multi-layered-json-object/m-p/14535#M9007</guid>
      <dc:creator>aschiff</dc:creator>
      <dc:date>2022-07-13T16:32:15Z</dc:date>
    </item>
    <item>
      <title>Re: Extracting data from a multi-layered JSON object</title>
      <link>https://community.databricks.com/t5/data-engineering/extracting-data-from-a-multi-layered-json-object/m-p/14536#M9008</link>
      <description>&lt;P&gt;This looks like it is not parsing which is interesting. Are you using the latest databricks runtime in your cluster?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Also, does this part give you errors?&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;select 
        from_json(contacts, """array&amp;lt;struct&amp;lt;address:struct&amp;lt;apartment:string,city:string,house:string,poBox:string,sources:array&amp;lt;string&amp;gt;,state:string,street:string,type:string,zip:string&amp;gt;,addresses:array&amp;lt;struct&amp;lt;apartment:string,city:string,house:string,lastSeen:string,poBox:string,sources:array&amp;lt;string&amp;gt;,state:string,street:string,type:string,zip:string&amp;gt;&amp;gt;,contactKey:string,emails:array&amp;lt;struct&amp;lt;emailId:string,lastSeen:string,sources:array&amp;lt;string&amp;gt;&amp;gt;&amp;gt;,lastModifiedDate:string,name:struct&amp;lt;firstNames:array&amp;lt;string&amp;gt;,lastNames:array&amp;lt;string&amp;gt;,middleNames:array&amp;lt;string&amp;gt;,salutations:array&amp;lt;string&amp;gt;,suffixes:array&amp;lt;string&amp;gt;&amp;gt;,phones:array&amp;lt;struct&amp;lt;extension:string,lastSeen:string,lineType:string,number:string,sources:array&amp;lt;string&amp;gt;,validSince:string&amp;gt;&amp;gt;,relationship:string,sources:array&amp;lt;string&amp;gt;&amp;gt;&amp;gt;""") contacts_parsed, 
        * 
      from owner_final_delta&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&amp;nbsp;If it parses correctly the output in that "contacts_parsed" column should show you an arrow on the top left part of each cell in that column that points to the right. You click it and it breaks down the pieces in the json file.&lt;/P&gt;</description>
      <pubDate>Wed, 13 Jul 2022 17:01:36 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/extracting-data-from-a-multi-layered-json-object/m-p/14536#M9008</guid>
      <dc:creator>Dooley</dc:creator>
      <dc:date>2022-07-13T17:01:36Z</dc:date>
    </item>
    <item>
      <title>Re: Extracting data from a multi-layered JSON object</title>
      <link>https://community.databricks.com/t5/data-engineering/extracting-data-from-a-multi-layered-json-object/m-p/14537#M9009</link>
      <description>&lt;P&gt;My cluster is on databricks runtime 11.0, spark 3.3.0 and Scala 2.12 and yes that part gives me the error.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="image"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1736i5FE13CFB8B90FB44/image-size/large?v=v2&amp;amp;px=999" role="button" title="image" alt="image" /&gt;&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 13 Jul 2022 17:11:05 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/extracting-data-from-a-multi-layered-json-object/m-p/14537#M9009</guid>
      <dc:creator>aschiff</dc:creator>
      <dc:date>2022-07-13T17:11:05Z</dc:date>
    </item>
    <item>
      <title>Re: Extracting data from a multi-layered JSON object</title>
      <link>https://community.databricks.com/t5/data-engineering/extracting-data-from-a-multi-layered-json-object/m-p/14538#M9010</link>
      <description>&lt;P&gt;So when I read it in, this is how I read it in. I'm changing the temporary view table name so it wont conflict with your other table.&lt;/P&gt;&lt;P&gt;Python:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;df1 = spark.read.format("csv").option("header", "true").option("escape", "\"").option("multiLine",True).load("file_path_where_csv_file_is_located")
df1.createOrReplaceTempView("owner_final_delta1")
display(df1)&lt;/CODE&gt;&lt;/PRE&gt;&lt;PRE&gt;&lt;CODE&gt;      select 
        from_json(contacts, """array&amp;lt;struct&amp;lt;address:struct&amp;lt;apartment:string,city:string,house:string,poBox:string,sources:array&amp;lt;string&amp;gt;,state:string,street:string,type:string,zip:string&amp;gt;,addresses:array&amp;lt;struct&amp;lt;apartment:string,city:string,house:string,lastSeen:string,poBox:string,sources:array&amp;lt;string&amp;gt;,state:string,street:string,type:string,zip:string&amp;gt;&amp;gt;,contactKey:string,emails:array&amp;lt;struct&amp;lt;emailId:string,lastSeen:string,sources:array&amp;lt;string&amp;gt;&amp;gt;&amp;gt;,lastModifiedDate:string,name:struct&amp;lt;firstNames:array&amp;lt;string&amp;gt;,lastNames:array&amp;lt;string&amp;gt;,middleNames:array&amp;lt;string&amp;gt;,salutations:array&amp;lt;string&amp;gt;,suffixes:array&amp;lt;string&amp;gt;&amp;gt;,phones:array&amp;lt;struct&amp;lt;extension:string,lastSeen:string,lineType:string,number:string,sources:array&amp;lt;string&amp;gt;,validSince:string&amp;gt;&amp;gt;,relationship:string,sources:array&amp;lt;string&amp;gt;&amp;gt;&amp;gt;""") contacts_parsed, 
        * 
      from owner_final_delta1&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;Then the last one is SQL.&lt;/P&gt;</description>
      <pubDate>Wed, 13 Jul 2022 17:22:21 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/extracting-data-from-a-multi-layered-json-object/m-p/14538#M9010</guid>
      <dc:creator>Dooley</dc:creator>
      <dc:date>2022-07-13T17:22:21Z</dc:date>
    </item>
    <item>
      <title>Re: Extracting data from a multi-layered JSON object</title>
      <link>https://community.databricks.com/t5/data-engineering/extracting-data-from-a-multi-layered-json-object/m-p/14539#M9011</link>
      <description>&lt;P&gt;This appears to have worked! I first uploaded it as a csv but turns out data was lost so that didn't work. But then I saved it as a .xlsx file and uploaded that to databricks. I then made a table out of it and saw that all the attributes (contacts included) were of string type. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;When running the above SQL command you posted, the output was returned with the contacts_parsed column and the little arrow allowing me to see the breakdown of the JSON object. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Then when running this as part of the larger query earlier with explode, I saw the emailIds column perfectly. My only remaining question is if this worked on contacts that had multiple emailIds. I don't know if there were any examples of this in the sample I gave you, but based on the solution you provided should it work on those contacts with multiple emailIds? "Working" would entail each row having one emailId, even if one contact has multiple emailIds, break them up so one is in each row of the emailIds column.&lt;/P&gt;</description>
      <pubDate>Wed, 13 Jul 2022 19:33:54 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/extracting-data-from-a-multi-layered-json-object/m-p/14539#M9011</guid>
      <dc:creator>aschiff</dc:creator>
      <dc:date>2022-07-13T19:33:54Z</dc:date>
    </item>
    <item>
      <title>Re: Extracting data from a multi-layered JSON object</title>
      <link>https://community.databricks.com/t5/data-engineering/extracting-data-from-a-multi-layered-json-object/m-p/14540#M9012</link>
      <description>&lt;P&gt;It should with the explode function but the sample data I have did not have a case where there were more than one emailId in that array so you will have to do a group by and see if you see a count greater than 1 in your own larger data set. I did see there were plenty of empty emailIds and empty contact cells in our sample, so it does filter through those.&lt;/P&gt;</description>
      <pubDate>Wed, 13 Jul 2022 20:14:15 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/extracting-data-from-a-multi-layered-json-object/m-p/14540#M9012</guid>
      <dc:creator>Dooley</dc:creator>
      <dc:date>2022-07-13T20:14:15Z</dc:date>
    </item>
    <item>
      <title>Re: Extracting data from a multi-layered JSON object</title>
      <link>https://community.databricks.com/t5/data-engineering/extracting-data-from-a-multi-layered-json-object/m-p/14541#M9013</link>
      <description>&lt;P&gt;That would be group by the _id and check the count greater than 1 right?&lt;/P&gt;</description>
      <pubDate>Wed, 13 Jul 2022 20:24:37 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/extracting-data-from-a-multi-layered-json-object/m-p/14541#M9013</guid>
      <dc:creator>aschiff</dc:creator>
      <dc:date>2022-07-13T20:24:37Z</dc:date>
    </item>
    <item>
      <title>Re: Extracting data from a multi-layered JSON object</title>
      <link>https://community.databricks.com/t5/data-engineering/extracting-data-from-a-multi-layered-json-object/m-p/14542#M9014</link>
      <description>&lt;P&gt;Yup. But you have to make sure you have duplicates in the data to begin with. If you can't find one right away to do your test on, I would do a test where I add a row of synthetic data that has multiple emailIds within that array. Then when you do a SELECT on the parsed table version, you can do a WHERE on that ID number of your synthetic data to see if there are more than one row.&lt;/P&gt;</description>
      <pubDate>Wed, 13 Jul 2022 20:33:41 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/extracting-data-from-a-multi-layered-json-object/m-p/14542#M9014</guid>
      <dc:creator>Dooley</dc:creator>
      <dc:date>2022-07-13T20:33:41Z</dc:date>
    </item>
  </channel>
</rss>

