cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
holly
Valued Contributor III

When making videos on new features announced, part of the process is researching what the feature is, thinking how best to demo it and then making the demo itself.

But here’s the twist: by definition, I don’t know what these new features do and this often leads to many headaches as I stumble my way to understanding.

holly_0-1720013601697.png

 Credit wow this meme is old

Figuring out what something is and crucially, what it isn’t, is an important part of the learning process. In my videos, I only have 180 seconds to focus on the affirmative and, occasionally, clear up any glaring confusion. 

In this blog series, I’d like to focus on areas that tripped me up personally. Whether this is getting myself stuck in infrastructure dependency hell, failing to grasp compute science fundamentals, or embarrassingly often, looking at the documentation without letting the words penetrate my skull.

Not every demo is like this, so it won’t be a 1:1 ratio of videos:blogs. And to start, I’d like to talk about the feature Variant.

How not to make a Variant demo

What it is

Variant is a new column type for your tables. The concept has been around for two decades now, the oldest reference I could find online for it was a C++ forum post from 1999. 

The idea is that it makes ingesting and storing of inconsistent and varying JSON data easier and faster to deal with. 

Here’s the finished product: Say goodbye to messy JSON headaches with VARIANT

What it isn't

My confusion of this feature stemmed from the casual language we use to describe JSON data. 

Technically, JSON is a file type. Normally some hideous mess of curly brackets spat out of websites and IoT devices. It’s the best thing we have for these systems, but my Pavlovian reaction is dread that my pipeline is about to get 10x more complicated.

When working on these pipelines and the data is saved to a table, I’ve heard it referred to colloquially as a column type

“What’s in this bronze table?”
“Oh that’s raw JSON and metadata columns”

When thinking about this critically, this is obviously incorrect. When doing DESCRIBE EXTENDED on a table, the word ‘JSON’ is never next to a column name. By the time it’s part of your table, it’s probably a String or combination of Struct and Array.

Here’s where that confusion intensifies.

Variant is a column type that only accepts file types.

The docs have this snippet as an example:

 

CREATE TABLE store_data AS
SELECT parse_json(
  '{
    "store":{
        "fruit": [
          {"weight":8,"type":"apple"},
          {"weight":9,"type":"pear"}
           –- many many lines removed for brevity
        ] }'
) as raw

 

When looking at this snippet, I was under the impression that Strings could be turned into Variants. This is not correct. 

If you create a table with these messy curly brackets in a String column, you cannot convert that String to a Variant. 

This has big implications for anyone looking to ingest JSON data as the conversion to Variant must be the first step. This means that your Bronze tables are the most likely to contain Variant types.

What else is noteworthy

Once a video goes up in multiple places I see the common questions that come through. Often it’s a scenario I either didn’t consider, or thought too niche to include. But sometimes it’s where I’ve really botched my explanation. 

Here were some of the topics that came out of my comments section.

How can you select a mish mash of JSON files?

Quick reminder that this asterisk syntax exists to select all JSON files within a particular file location sub directory

 

select * from json.`s3:/your/directory/here/*.json`

 

What columns are created?

Only one column is created to contain all JSON records, even if you have multiple files or layouts within them. This is unlike a Struct where we have to be consistent and strict on how these are defined. 

holly_1-1720013601661.png

In my example here we can see that row 5 has null device_id row 8 has text instead of a number for device_id, 9 onwards don’t even reference a device_id

What if I want multiple columns?

As you curate your data, it’s less likely that Silver tables will contain variant types as you’ve started extracting all the relevant parts for your end users. 

You can start manipulating the Variant column like you would any Struct or Array, but there is slightly different syntax for some functionality. In place of explode() you would use variant_explode(). The errors are not indecipherable, but not helpful in pointing you in the right direction either. 

Check out the docs for more information.

TL:DR

  • Variant turns a file type into a column type
  • This should be the first step in your Bronze pipeline
  • Each record will go into a single column, regardless of how similar they are to previous records
  • Use the Variant version of SQL functions to build Silver tables 

Further reading

Docs - Official Blog - Engineering talk - Spark Github - Delta Github 

1 Comment
AlexCole
New Contributor III

Good post Holly! Nice to see the information written down as well.

If you are interested in the background to variant data types you'll find they are often called tagged unions (Tagged union - Wikipedia) and have a history in the typed lambda calculus