4 weeks ago - last edited 4 weeks ago
Edit: value of inner key "value" was an array - I have added the square brackets to the example below.
Hello all,
I'm working with Spark SQL API for querying semi-structured data in Databricks. Currently I'm having a hard time understanding how I can navigate a complex column of the type `VARIANT` to extract field values that are nested at multiple levels.
Take the below example (column name "extendedinformation") in a table (table name "orders") that have been saved as a complex type containing nested json like so:
[
{
"name": "CHANNEL",
"value": "[{\"id\":\"DUMMYID1\",\"name\":\"DUMMYCHANNEL1\",\"role\":\"DUMMYROLE1\"}]"
}
]
When querying the data using Spark SQL API on a SQL Warehouse in the Preview Channel (2024.40), I'm unable to find the best way to extract the inner field "name" ($[0].value.name).
If I use variant functions and shortcuts as so:
SELECT extendedinformation:[0].value FROM orders
i get back the contents of $[0].value, and as far as the front-end editor in Databricks shows, this column is a complex type as this (whether this is the true underlying truth is unknown to me, but this i what is shown in the online editor):
[{"id":"DUMMYID1","name":"DUMMYCHANNEL1","role":"DUMMYROLE1"}]
Note that it seems like the contents have been parsed automatically and correctly.
Now I had hoped I could do the following:
SELECT extendedinformation:[0].value.name FROM orders
or
SELECT extendedinformation:[0].value:name FROM orders
But all I get back are `NULL` values.
There might be way to get it to work wrapping `GET_JSON_OBJECT()`, but this is very finicky, has resulted in errors in terms of parsing without changing modes (`FAILFAST` -> `PERMISSIVE`) and, as far as I can understand, might very well incure a performance penalty because of conversions between complex and strings (and then parsing).
How should one approach this using complex functionality in Spark SQL in Databricks?
4 weeks ago - last edited 3 weeks ago
Hi @QueryingQuagga ,
So your problem comes from the fact, that you have nested json string as a value. That's why you get null when you're trying to do following:
SELECT extendedinformation:[0].value.name
So what you can do is to explode elements in extendedinformation array and then you can use from_json to parse string contained in value key:
"value": "[{\"id\":\"DUMMYID1\",\"name\":\"DUMMYCHANNEL1\",\"role\":\"DUMMYROLE1\"}]"
Below is a full example:
%sql
WITH src AS (
SELECT
parse_json('{
"extendedinformation": [
{
"name": "CHANNEL",
"value": "[{\\"id\\":\\"DUMMYID1\\",\\"name\\":\\"DUMMYCHANNEL1\\",\\"role\\":\\"DUMMYROLE1\\"}]"
},
{
"name": "CHANNEL2",
"value": "[{\\"id\\":\\"DUMMYID2\\",\\"name\\":\\"DUMMYCHANNEL2\\",\\"role\\":\\"DUMMYROLE2\\"}]"
}
]
}') AS data
),
extended AS (
SELECT exp.value as extendedinformation_exploded
FROM src,
LATERAL variant_explode(data:extendedinformation) exp
)
SELECT *, from_json(extendedinformation_exploded:value::STRING, 'ARRAY<MAP<STRING, STRING>>')[0].name AS value_array
FROM extended
4 weeks ago
Is this of any help: Query JSON strings - Azure Databricks | Microsoft Learn? A companion article if you're actually using VARIANT is Query variant data - Azure Databricks | Microsoft Learn.
4 weeks ago
Thank you - I'm looking into a solution using `LATERAL`.
4 weeks ago - last edited 4 weeks ago
Hi @QueryingQuagga ,
Maybe something like that?:
%sql
WITH src AS (
SELECT
parse_json('{
"extendedinformation":[
{ "name": "CHANNEL", "value": [{\"id\":\"DUMMYID1\",\"name\":\"DUMMYCHANNEL1\",\"role\":\"DUMMYROLE1\"}]},
{ "name": "CHANNEL2", "value": [{\"id\":\"DUMMYID2\",\"name\":\"DUMMYCHANNEL2\",\"role\":\"DUMMYROLE2\"}]}
]
}') as data
)
SELECT data:extendedinformation:[0].value[0].name
FROM src
And here's an outcome:
4 weeks ago
I'm unsure if the schema that you are parsing above is the correct one though (I might be misunderstanding). If I edit your json to the following:
{
"extendedinformation":[
{
"name": "CHANNEL",
"value": "[{\"id\":\"DUMMYID1\",\"name\":\"DUMMYCHANNEL1\",\"role\":\"DUMMYROLE1\"}]"
}
]
}
Then I can't even get the `parse_json()` function to parse (although the above is valid json).
4 weeks ago - last edited 3 weeks ago
Hi @QueryingQuagga ,
So your problem comes from the fact, that you have nested json string as a value. That's why you get null when you're trying to do following:
SELECT extendedinformation:[0].value.name
So what you can do is to explode elements in extendedinformation array and then you can use from_json to parse string contained in value key:
"value": "[{\"id\":\"DUMMYID1\",\"name\":\"DUMMYCHANNEL1\",\"role\":\"DUMMYROLE1\"}]"
Below is a full example:
%sql
WITH src AS (
SELECT
parse_json('{
"extendedinformation": [
{
"name": "CHANNEL",
"value": "[{\\"id\\":\\"DUMMYID1\\",\\"name\\":\\"DUMMYCHANNEL1\\",\\"role\\":\\"DUMMYROLE1\\"}]"
},
{
"name": "CHANNEL2",
"value": "[{\\"id\\":\\"DUMMYID2\\",\\"name\\":\\"DUMMYCHANNEL2\\",\\"role\\":\\"DUMMYROLE2\\"}]"
}
]
}') AS data
),
extended AS (
SELECT exp.value as extendedinformation_exploded
FROM src,
LATERAL variant_explode(data:extendedinformation) exp
)
SELECT *, from_json(extendedinformation_exploded:value::STRING, 'ARRAY<MAP<STRING, STRING>>')[0].name AS value_array
FROM extended
Thursday
So your problem comes from the fact, that you have nested json string as a value.
This is of course the case and I see this now. I think I might have been confused by not knowing enough by VARIANT and the fact that the online DBR SQL editor will do some helpful semi-transparent output formatting.
Thank you for your help - I can now do a lateral explode and recombine to original granularity. If not performant enough I will go upstream and see if something can be done closer to the source schema.
Thursday
Hi @QueryingQuagga ,
Great that you managed to solve the problem. Learning new features can sometimes be really confusing 🙂
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group