cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Working with semi-structured data (complex - variant)

QueryingQuagga
New Contributor III

Edit: value of inner key "value" was an array - I have added the square brackets to the example below.

Hello all,

I'm working with Spark SQL API for querying semi-structured data in Databricks. Currently I'm having a hard time understanding how I can navigate a complex column of the type `VARIANT` to extract field values that are nested at multiple levels.

Take the below example (column name "extendedinformation") in a table (table name "orders") that have been saved as a complex type containing nested json like so:

 

[
  {
    "name": "CHANNEL",
    "value": "[{\"id\":\"DUMMYID1\",\"name\":\"DUMMYCHANNEL1\",\"role\":\"DUMMYROLE1\"}]"
  }
]

 

When querying the data using Spark SQL API on a SQL Warehouse in the Preview Channel (2024.40), I'm unable to find the best way to extract the inner field "name" ($[0].value.name).

If I use variant functions and shortcuts as so:

 

SELECT extendedinformation:[0].value FROM orders

 

i get back the contents of $[0].value, and as far as the front-end editor in Databricks shows, this column is a complex type as this (whether this is the true underlying truth is unknown to me, but this i what is shown in the online editor):

 

[{"id":"DUMMYID1","name":"DUMMYCHANNEL1","role":"DUMMYROLE1"}]

 

Note that it seems like the contents have been parsed automatically and correctly.

Now I had hoped I could do the following:

 

SELECT extendedinformation:[0].value.name FROM orders

 

or

 

SELECT extendedinformation:[0].value:name FROM orders

 

But all I get back are `NULL` values.

There might be way to get it to work wrapping `GET_JSON_OBJECT()`, but this is very finicky, has resulted in errors in terms of parsing without changing modes (`FAILFAST` -> `PERMISSIVE`) and, as far as I can understand, might very well incure a performance penalty because of conversions between complex and strings (and then parsing).

How should one approach this using complex functionality in Spark SQL in Databricks?

1 ACCEPTED SOLUTION

Accepted Solutions

Hi @QueryingQuagga ,

So your problem comes from the fact, that you have nested json string as a value. That's why you get null when you're trying to do following:

 

 

 

SELECT extendedinformation:[0].value.name

 

 

 

So what you can do is to explode elements in extendedinformation array and then you can use from_json to parse string contained in value key:

"value": "[{\"id\":\"DUMMYID1\",\"name\":\"DUMMYCHANNEL1\",\"role\":\"DUMMYROLE1\"}]"

 

Below is a full example:

 

 

%sql

WITH src AS (
  SELECT 
    parse_json('{
        "extendedinformation": [
            {
                "name": "CHANNEL",
                "value": "[{\\"id\\":\\"DUMMYID1\\",\\"name\\":\\"DUMMYCHANNEL1\\",\\"role\\":\\"DUMMYROLE1\\"}]"
            },
            {
                "name": "CHANNEL2",
                "value": "[{\\"id\\":\\"DUMMYID2\\",\\"name\\":\\"DUMMYCHANNEL2\\",\\"role\\":\\"DUMMYROLE2\\"}]"
            }
        ]
    }') AS data
),
extended AS (
  SELECT exp.value as extendedinformation_exploded   
  FROM src,
  LATERAL variant_explode(data:extendedinformation) exp
)
SELECT *, from_json(extendedinformation_exploded:value::STRING, 'ARRAY<MAP<STRING, STRING>>')[0].name AS value_array 
FROM extended

 

 

 

 

View solution in original post

7 REPLIES 7

Rjdudley
Contributor II

Is this of any help: Query JSON strings - Azure Databricks | Microsoft Learn?  A companion article if you're actually using VARIANT is Query variant data - Azure Databricks | Microsoft Learn.

Thank you - I'm looking into a solution using `LATERAL`.

szymon_dybczak
Esteemed Contributor III

Hi @QueryingQuagga ,

Maybe something like that?:

 

%sql
WITH src AS (
    SELECT 
      parse_json('{
        "extendedinformation":[
            { "name": "CHANNEL", "value": [{\"id\":\"DUMMYID1\",\"name\":\"DUMMYCHANNEL1\",\"role\":\"DUMMYROLE1\"}]},
            { "name": "CHANNEL2", "value": [{\"id\":\"DUMMYID2\",\"name\":\"DUMMYCHANNEL2\",\"role\":\"DUMMYROLE2\"}]}
        ]
      }') as data
)
SELECT data:extendedinformation:[0].value[0].name
FROM src

 

And here's an outcome:

szymon_dybczak_0-1733848580649.png

 

I'm unsure if the schema that you are parsing above is the correct one though (I might be misunderstanding). If I edit your json to the following:

 

{
  "extendedinformation":[
      {
          "name": "CHANNEL",
          "value": "[{\"id\":\"DUMMYID1\",\"name\":\"DUMMYCHANNEL1\",\"role\":\"DUMMYROLE1\"}]"
      }
  ]
}

 

Then I can't even get the `parse_json()` function to parse (although the above is valid json).

Hi @QueryingQuagga ,

So your problem comes from the fact, that you have nested json string as a value. That's why you get null when you're trying to do following:

 

 

 

SELECT extendedinformation:[0].value.name

 

 

 

So what you can do is to explode elements in extendedinformation array and then you can use from_json to parse string contained in value key:

"value": "[{\"id\":\"DUMMYID1\",\"name\":\"DUMMYCHANNEL1\",\"role\":\"DUMMYROLE1\"}]"

 

Below is a full example:

 

 

%sql

WITH src AS (
  SELECT 
    parse_json('{
        "extendedinformation": [
            {
                "name": "CHANNEL",
                "value": "[{\\"id\\":\\"DUMMYID1\\",\\"name\\":\\"DUMMYCHANNEL1\\",\\"role\\":\\"DUMMYROLE1\\"}]"
            },
            {
                "name": "CHANNEL2",
                "value": "[{\\"id\\":\\"DUMMYID2\\",\\"name\\":\\"DUMMYCHANNEL2\\",\\"role\\":\\"DUMMYROLE2\\"}]"
            }
        ]
    }') AS data
),
extended AS (
  SELECT exp.value as extendedinformation_exploded   
  FROM src,
  LATERAL variant_explode(data:extendedinformation) exp
)
SELECT *, from_json(extendedinformation_exploded:value::STRING, 'ARRAY<MAP<STRING, STRING>>')[0].name AS value_array 
FROM extended

 

 

 

 


So your problem comes from the fact, that you have nested json string as a value.


This is of course the case and I see this now. I think I might have been confused by not knowing enough by VARIANT and the fact that the online DBR SQL editor will do some helpful semi-transparent output formatting.

Thank you for your help - I can now do a lateral explode and recombine to original granularity. If not performant enough I will go upstream and see if something can be done closer to the source schema.

Hi @QueryingQuagga ,

Great that you managed to solve the problem. Learning new features can sometimes be really confusing 🙂

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group