Working with semi-structured data (complex - variant)

QueryingQuagga
New Contributor III

Edit: value of inner key "value" was an array - I have added the square brackets to the example below.

Hello all,

I'm working with Spark SQL API for querying semi-structured data in Databricks. Currently I'm having a hard time understanding how I can navigate a complex column of the type `VARIANT` to extract field values that are nested at multiple levels.

Take the below example (column name "extendedinformation") in a table (table name "orders") that have been saved as a complex type containing nested json like so:

 

[
  {
    "name": "CHANNEL",
    "value": "[{\"id\":\"DUMMYID1\",\"name\":\"DUMMYCHANNEL1\",\"role\":\"DUMMYROLE1\"}]"
  }
]

 

When querying the data using Spark SQL API on a SQL Warehouse in the Preview Channel (2024.40), I'm unable to find the best way to extract the inner field "name" ($[0].value.name).

If I use variant functions and shortcuts as so:

 

SELECT extendedinformation:[0].value FROM orders

 

i get back the contents of $[0].value, and as far as the front-end editor in Databricks shows, this column is a complex type as this (whether this is the true underlying truth is unknown to me, but this i what is shown in the online editor):

 

[{"id":"DUMMYID1","name":"DUMMYCHANNEL1","role":"DUMMYROLE1"}]

 

Note that it seems like the contents have been parsed automatically and correctly.

Now I had hoped I could do the following:

 

SELECT extendedinformation:[0].value.name FROM orders

 

or

 

SELECT extendedinformation:[0].value:name FROM orders

 

But all I get back are `NULL` values.

There might be way to get it to work wrapping `GET_JSON_OBJECT()`, but this is very finicky, has resulted in errors in terms of parsing without changing modes (`FAILFAST` -> `PERMISSIVE`) and, as far as I can understand, might very well incure a performance penalty because of conversions between complex and strings (and then parsing).

How should one approach this using complex functionality in Spark SQL in Databricks?