<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Working with semi-structured data (complex - variant) in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/working-with-semi-structured-data-complex-variant/m-p/101712#M40782</link>
    <description>&lt;P&gt;I'm unsure if the schema that you are parsing above is the correct one though (I might be misunderstanding). If I edit your json to the following:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="javascript"&gt;{
  "extendedinformation":[
      {
          "name": "CHANNEL",
          "value": "[{\"id\":\"DUMMYID1\",\"name\":\"DUMMYCHANNEL1\",\"role\":\"DUMMYROLE1\"}]"
      }
  ]
}&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Then I can't even get the `parse_json()` function to parse (although the above is valid json).&lt;/P&gt;</description>
    <pubDate>Wed, 11 Dec 2024 08:42:35 GMT</pubDate>
    <dc:creator>QueryingQuagga</dc:creator>
    <dc:date>2024-12-11T08:42:35Z</dc:date>
    <item>
      <title>Working with semi-structured data (complex - variant)</title>
      <link>https://community.databricks.com/t5/data-engineering/working-with-semi-structured-data-complex-variant/m-p/101606#M40745</link>
      <description>&lt;P&gt;Edit: value of inner key "value" was an array - I have added the square brackets to the example below.&lt;/P&gt;&lt;P&gt;Hello all,&lt;/P&gt;&lt;P&gt;I'm working with Spark SQL API for querying semi-structured data in Databricks. Currently I'm having a hard time understanding how I can navigate a complex column of the type `VARIANT` to extract field values that are nested at multiple levels.&lt;/P&gt;&lt;P&gt;Take the below example (column name "extendedinformation") in a table (table name "orders") that have been saved as a complex type containing nested json like so:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="javascript"&gt;[
  {
    "name": "CHANNEL",
    "value": "[{\"id\":\"DUMMYID1\",\"name\":\"DUMMYCHANNEL1\",\"role\":\"DUMMYROLE1\"}]"
  }
]&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;When querying the data using Spark SQL API on a SQL Warehouse in the Preview Channel (&lt;SPAN&gt;2024.40),&amp;nbsp;&lt;/SPAN&gt;I'm unable to find the best way to extract the inner field "name" ($[0].value.name).&lt;/P&gt;&lt;P&gt;If I use variant functions and shortcuts as so:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="javascript"&gt;SELECT extendedinformation:[0].value FROM orders&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;i get back the contents of $[0].value, and as far as the front-end editor in Databricks shows, this column is a complex type as this (whether this is the true underlying truth is unknown to me, but this i what is shown in the online editor):&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="javascript"&gt;[{"id":"DUMMYID1","name":"DUMMYCHANNEL1","role":"DUMMYROLE1"}]&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Note that it seems like the contents have been parsed automatically and correctly.&lt;/P&gt;&lt;P&gt;Now I had hoped I could do the following:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="javascript"&gt;SELECT extendedinformation:[0].value.name FROM orders&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;or&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="javascript"&gt;SELECT extendedinformation:[0].value:name FROM orders&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;But all I get back are `NULL` values.&lt;/P&gt;&lt;P&gt;There might be way to get it to work wrapping `GET_JSON_OBJECT()`, but this is very finicky, has resulted in errors in terms of parsing without changing modes (`FAILFAST` -&amp;gt; `PERMISSIVE`) and, as far as I can understand, might very well incure a performance penalty because of conversions between complex and strings (and then parsing).&lt;/P&gt;&lt;P&gt;How should one approach this using complex functionality in Spark SQL in Databricks?&lt;/P&gt;</description>
      <pubDate>Tue, 10 Dec 2024 13:31:33 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/working-with-semi-structured-data-complex-variant/m-p/101606#M40745</guid>
      <dc:creator>QueryingQuagga</dc:creator>
      <dc:date>2024-12-10T13:31:33Z</dc:date>
    </item>
    <item>
      <title>Re: Working with semi-structured data (complex - variant)</title>
      <link>https://community.databricks.com/t5/data-engineering/working-with-semi-structured-data-complex-variant/m-p/101614#M40751</link>
      <description>&lt;P&gt;Is this of any help:&amp;nbsp;&lt;A href="https://learn.microsoft.com/en-us/azure/databricks/semi-structured/json" target="_blank"&gt;Query JSON strings - Azure Databricks | Microsoft Learn&lt;/A&gt;?&amp;nbsp; A companion article if you're actually using VARIANT is&amp;nbsp;&lt;A href="https://learn.microsoft.com/en-us/azure/databricks/semi-structured/variant" target="_blank"&gt;Query variant data - Azure Databricks | Microsoft Learn&lt;/A&gt;.&lt;/P&gt;</description>
      <pubDate>Tue, 10 Dec 2024 14:32:21 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/working-with-semi-structured-data-complex-variant/m-p/101614#M40751</guid>
      <dc:creator>Rjdudley</dc:creator>
      <dc:date>2024-12-10T14:32:21Z</dc:date>
    </item>
    <item>
      <title>Re: Working with semi-structured data (complex - variant)</title>
      <link>https://community.databricks.com/t5/data-engineering/working-with-semi-structured-data-complex-variant/m-p/101634#M40753</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/135083"&gt;@QueryingQuagga&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;Maybe something like that?:&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;%sql
WITH src AS (
    SELECT 
      parse_json('{
        "extendedinformation":[
            { "name": "CHANNEL", "value": [{\"id\":\"DUMMYID1\",\"name\":\"DUMMYCHANNEL1\",\"role\":\"DUMMYROLE1\"}]},
            { "name": "CHANNEL2", "value": [{\"id\":\"DUMMYID2\",\"name\":\"DUMMYCHANNEL2\",\"role\":\"DUMMYROLE2\"}]}
        ]
      }') as data
)
SELECT data:extendedinformation:[0].value[0].name
FROM src&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;And here's an outcome:&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="szymon_dybczak_0-1733848580649.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/13432i1461410941E58CCA/image-size/medium?v=v2&amp;amp;px=400" role="button" title="szymon_dybczak_0-1733848580649.png" alt="szymon_dybczak_0-1733848580649.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 10 Dec 2024 16:36:27 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/working-with-semi-structured-data-complex-variant/m-p/101634#M40753</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2024-12-10T16:36:27Z</dc:date>
    </item>
    <item>
      <title>Re: Working with semi-structured data (complex - variant)</title>
      <link>https://community.databricks.com/t5/data-engineering/working-with-semi-structured-data-complex-variant/m-p/101709#M40780</link>
      <description>&lt;P&gt;Thank you - I'm looking into a solution using `LATERAL`.&lt;/P&gt;</description>
      <pubDate>Wed, 11 Dec 2024 08:33:02 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/working-with-semi-structured-data-complex-variant/m-p/101709#M40780</guid>
      <dc:creator>QueryingQuagga</dc:creator>
      <dc:date>2024-12-11T08:33:02Z</dc:date>
    </item>
    <item>
      <title>Re: Working with semi-structured data (complex - variant)</title>
      <link>https://community.databricks.com/t5/data-engineering/working-with-semi-structured-data-complex-variant/m-p/101712#M40782</link>
      <description>&lt;P&gt;I'm unsure if the schema that you are parsing above is the correct one though (I might be misunderstanding). If I edit your json to the following:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="javascript"&gt;{
  "extendedinformation":[
      {
          "name": "CHANNEL",
          "value": "[{\"id\":\"DUMMYID1\",\"name\":\"DUMMYCHANNEL1\",\"role\":\"DUMMYROLE1\"}]"
      }
  ]
}&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Then I can't even get the `parse_json()` function to parse (although the above is valid json).&lt;/P&gt;</description>
      <pubDate>Wed, 11 Dec 2024 08:42:35 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/working-with-semi-structured-data-complex-variant/m-p/101712#M40782</guid>
      <dc:creator>QueryingQuagga</dc:creator>
      <dc:date>2024-12-11T08:42:35Z</dc:date>
    </item>
    <item>
      <title>Re: Working with semi-structured data (complex - variant)</title>
      <link>https://community.databricks.com/t5/data-engineering/working-with-semi-structured-data-complex-variant/m-p/101729#M40792</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/135083"&gt;@QueryingQuagga&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;So your problem comes from the fact, that you have nested json string as a value. That's why you get null when you're trying to do following:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="javascript"&gt;SELECT extendedinformation:[0].value.name&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;So what you can do is to explode elements in extendedinformation array and then you can use from_json to parse string contained in value key:&lt;/P&gt;&lt;PRE&gt;"value": "[{\"id\":\"DUMMYID1\",\"name\":\"DUMMYCHANNEL1\",\"role\":\"DUMMYROLE1\"}]"&lt;/PRE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Below is a full example:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;%sql

WITH src AS (
  SELECT 
    parse_json('{
        "extendedinformation": [
            {
                "name": "CHANNEL",
                "value": "[{\\"id\\":\\"DUMMYID1\\",\\"name\\":\\"DUMMYCHANNEL1\\",\\"role\\":\\"DUMMYROLE1\\"}]"
            },
            {
                "name": "CHANNEL2",
                "value": "[{\\"id\\":\\"DUMMYID2\\",\\"name\\":\\"DUMMYCHANNEL2\\",\\"role\\":\\"DUMMYROLE2\\"}]"
            }
        ]
    }') AS data
),
extended AS (
  SELECT exp.value as extendedinformation_exploded   
  FROM src,
  LATERAL variant_explode(data:extendedinformation) exp
)
SELECT *, from_json(extendedinformation_exploded:value::STRING, 'ARRAY&amp;lt;MAP&amp;lt;STRING, STRING&amp;gt;&amp;gt;')[0].name AS value_array 
FROM extended&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 11 Dec 2024 10:27:47 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/working-with-semi-structured-data-complex-variant/m-p/101729#M40792</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2024-12-11T10:27:47Z</dc:date>
    </item>
    <item>
      <title>Re: Working with semi-structured data (complex - variant)</title>
      <link>https://community.databricks.com/t5/data-engineering/working-with-semi-structured-data-complex-variant/m-p/103873#M41581</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;P&gt;So your problem comes from the fact, that you have nested json string as a value.&lt;/P&gt;&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;This is of course the case and I see this now. I think I might have been confused by not knowing enough by VARIANT and the fact that the online DBR SQL editor will do some helpful semi-transparent output formatting.&lt;/P&gt;&lt;P&gt;Thank you for your help - I can now do a lateral explode and recombine to original granularity. If not performant enough I will go upstream and see if something can be done closer to the source schema.&lt;/P&gt;</description>
      <pubDate>Thu, 02 Jan 2025 08:51:10 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/working-with-semi-structured-data-complex-variant/m-p/103873#M41581</guid>
      <dc:creator>QueryingQuagga</dc:creator>
      <dc:date>2025-01-02T08:51:10Z</dc:date>
    </item>
    <item>
      <title>Re: Working with semi-structured data (complex - variant)</title>
      <link>https://community.databricks.com/t5/data-engineering/working-with-semi-structured-data-complex-variant/m-p/103885#M41589</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/135083"&gt;@QueryingQuagga&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;Great that you managed to solve the problem. Learning new features can sometimes be really confusing &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 02 Jan 2025 11:01:09 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/working-with-semi-structured-data-complex-variant/m-p/103885#M41589</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2025-01-02T11:01:09Z</dc:date>
    </item>
  </channel>
</rss>

