- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-10-2024 05:23 AM - edited 12-10-2024 05:31 AM
Edit: value of inner key "value" was an array - I have added the square brackets to the example below.
Hello all,
I'm working with Spark SQL API for querying semi-structured data in Databricks. Currently I'm having a hard time understanding how I can navigate a complex column of the type `VARIANT` to extract field values that are nested at multiple levels.
Take the below example (column name "extendedinformation") in a table (table name "orders") that have been saved as a complex type containing nested json like so:
[
{
"name": "CHANNEL",
"value": "[{\"id\":\"DUMMYID1\",\"name\":\"DUMMYCHANNEL1\",\"role\":\"DUMMYROLE1\"}]"
}
]
When querying the data using Spark SQL API on a SQL Warehouse in the Preview Channel (2024.40), I'm unable to find the best way to extract the inner field "name" ($[0].value.name).
If I use variant functions and shortcuts as so:
SELECT extendedinformation:[0].value FROM orders
i get back the contents of $[0].value, and as far as the front-end editor in Databricks shows, this column is a complex type as this (whether this is the true underlying truth is unknown to me, but this i what is shown in the online editor):
[{"id":"DUMMYID1","name":"DUMMYCHANNEL1","role":"DUMMYROLE1"}]
Note that it seems like the contents have been parsed automatically and correctly.
Now I had hoped I could do the following:
SELECT extendedinformation:[0].value.name FROM orders
or
SELECT extendedinformation:[0].value:name FROM orders
But all I get back are `NULL` values.
There might be way to get it to work wrapping `GET_JSON_OBJECT()`, but this is very finicky, has resulted in errors in terms of parsing without changing modes (`FAILFAST` -> `PERMISSIVE`) and, as far as I can understand, might very well incure a performance penalty because of conversions between complex and strings (and then parsing).
How should one approach this using complex functionality in Spark SQL in Databricks?
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-11-2024 02:12 AM - edited 12-11-2024 02:27 AM
Hi @QueryingQuagga ,
So your problem comes from the fact, that you have nested json string as a value. That's why you get null when you're trying to do following:
SELECT extendedinformation:[0].value.name
So what you can do is to explode elements in extendedinformation array and then you can use from_json to parse string contained in value key:
"value": "[{\"id\":\"DUMMYID1\",\"name\":\"DUMMYCHANNEL1\",\"role\":\"DUMMYROLE1\"}]"
Below is a full example:
%sql
WITH src AS (
SELECT
parse_json('{
"extendedinformation": [
{
"name": "CHANNEL",
"value": "[{\\"id\\":\\"DUMMYID1\\",\\"name\\":\\"DUMMYCHANNEL1\\",\\"role\\":\\"DUMMYROLE1\\"}]"
},
{
"name": "CHANNEL2",
"value": "[{\\"id\\":\\"DUMMYID2\\",\\"name\\":\\"DUMMYCHANNEL2\\",\\"role\\":\\"DUMMYROLE2\\"}]"
}
]
}') AS data
),
extended AS (
SELECT exp.value as extendedinformation_exploded
FROM src,
LATERAL variant_explode(data:extendedinformation) exp
)
SELECT *, from_json(extendedinformation_exploded:value::STRING, 'ARRAY<MAP<STRING, STRING>>')[0].name AS value_array
FROM extended
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-10-2024 06:32 AM
Is this of any help: Query JSON strings - Azure Databricks | Microsoft Learn? A companion article if you're actually using VARIANT is Query variant data - Azure Databricks | Microsoft Learn.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-11-2024 12:33 AM
Thank you - I'm looking into a solution using `LATERAL`.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-10-2024 08:33 AM - edited 12-10-2024 08:36 AM
Hi @QueryingQuagga ,
Maybe something like that?:
%sql
WITH src AS (
SELECT
parse_json('{
"extendedinformation":[
{ "name": "CHANNEL", "value": [{\"id\":\"DUMMYID1\",\"name\":\"DUMMYCHANNEL1\",\"role\":\"DUMMYROLE1\"}]},
{ "name": "CHANNEL2", "value": [{\"id\":\"DUMMYID2\",\"name\":\"DUMMYCHANNEL2\",\"role\":\"DUMMYROLE2\"}]}
]
}') as data
)
SELECT data:extendedinformation:[0].value[0].name
FROM src
And here's an outcome:
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-11-2024 12:42 AM
I'm unsure if the schema that you are parsing above is the correct one though (I might be misunderstanding). If I edit your json to the following:
{
"extendedinformation":[
{
"name": "CHANNEL",
"value": "[{\"id\":\"DUMMYID1\",\"name\":\"DUMMYCHANNEL1\",\"role\":\"DUMMYROLE1\"}]"
}
]
}
Then I can't even get the `parse_json()` function to parse (although the above is valid json).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-11-2024 02:12 AM - edited 12-11-2024 02:27 AM
Hi @QueryingQuagga ,
So your problem comes from the fact, that you have nested json string as a value. That's why you get null when you're trying to do following:
SELECT extendedinformation:[0].value.name
So what you can do is to explode elements in extendedinformation array and then you can use from_json to parse string contained in value key:
"value": "[{\"id\":\"DUMMYID1\",\"name\":\"DUMMYCHANNEL1\",\"role\":\"DUMMYROLE1\"}]"
Below is a full example:
%sql
WITH src AS (
SELECT
parse_json('{
"extendedinformation": [
{
"name": "CHANNEL",
"value": "[{\\"id\\":\\"DUMMYID1\\",\\"name\\":\\"DUMMYCHANNEL1\\",\\"role\\":\\"DUMMYROLE1\\"}]"
},
{
"name": "CHANNEL2",
"value": "[{\\"id\\":\\"DUMMYID2\\",\\"name\\":\\"DUMMYCHANNEL2\\",\\"role\\":\\"DUMMYROLE2\\"}]"
}
]
}') AS data
),
extended AS (
SELECT exp.value as extendedinformation_exploded
FROM src,
LATERAL variant_explode(data:extendedinformation) exp
)
SELECT *, from_json(extendedinformation_exploded:value::STRING, 'ARRAY<MAP<STRING, STRING>>')[0].name AS value_array
FROM extended
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-02-2025 12:51 AM
So your problem comes from the fact, that you have nested json string as a value.
This is of course the case and I see this now. I think I might have been confused by not knowing enough by VARIANT and the fact that the online DBR SQL editor will do some helpful semi-transparent output formatting.
Thank you for your help - I can now do a lateral explode and recombine to original granularity. If not performant enough I will go upstream and see if something can be done closer to the source schema.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-02-2025 03:01 AM
Hi @QueryingQuagga ,
Great that you managed to solve the problem. Learning new features can sometimes be really confusing 🙂

