When making videos on new features announced, part of the process is researching what the feature is, thinking how best to demo it and then making the demo itself.
But here’s the twist: by definition, I don’t know what these new features do and this often leads to many headaches as I stumble my way to understanding.
Credit wow this meme is old
Figuring out what something is and crucially, what it isn’t, is an important part of the learning process. In my videos, I only have 180 seconds to focus on the affirmative and, occasionally, clear up any glaring confusion.
In this blog series, I’d like to focus on areas that tripped me up personally. Whether this is getting myself stuck in infrastructure dependency hell, failing to grasp compute science fundamentals, or embarrassingly often, looking at the documentation without letting the words penetrate my skull.
Not every demo is like this, so it won’t be a 1:1 ratio of videos:blogs. And to start, I’d like to talk about the feature Variant.
Variant is a new column type for your tables. The concept has been around for two decades now, the oldest reference I could find online for it was a C++ forum post from 1999.
The idea is that it makes ingesting and storing of inconsistent and varying JSON data easier and faster to deal with.
Here’s the finished product: Say goodbye to messy JSON headaches with VARIANT
My confusion of this feature stemmed from the casual language we use to describe JSON data.
Technically, JSON is a file type. Normally some hideous mess of curly brackets spat out of websites and IoT devices. It’s the best thing we have for these systems, but my Pavlovian reaction is dread that my pipeline is about to get 10x more complicated.
When working on these pipelines and the data is saved to a table, I’ve heard it referred to colloquially as a column type.
“What’s in this bronze table?”
“Oh that’s raw JSON and metadata columns”
When thinking about this critically, this is obviously incorrect. When doing DESCRIBE EXTENDED on a table, the word ‘JSON’ is never next to a column name. By the time it’s part of your table, it’s probably a String or combination of Struct and Array.
Here’s where that confusion intensifies.
Variant is a column type that only accepts file types.
The docs have this snippet as an example:
CREATE TABLE store_data AS
SELECT parse_json(
'{
"store":{
"fruit": [
{"weight":8,"type":"apple"},
{"weight":9,"type":"pear"}
–- many many lines removed for brevity
] }'
) as raw
When looking at this snippet, I was under the impression that Strings could be turned into Variants. This is not correct.
If you create a table with these messy curly brackets in a String column, you cannot convert that String to a Variant.
This has big implications for anyone looking to ingest JSON data as the conversion to Variant must be the first step. This means that your Bronze tables are the most likely to contain Variant types.
Once a video goes up in multiple places I see the common questions that come through. Often it’s a scenario I either didn’t consider, or thought too niche to include. But sometimes it’s where I’ve really botched my explanation.
Here were some of the topics that came out of my comments section.
Quick reminder that this asterisk syntax exists to select all JSON files within a particular file location sub directory
select * from json.`s3:/your/directory/here/*.json`
Only one column is created to contain all JSON records, even if you have multiple files or layouts within them. This is unlike a Struct where we have to be consistent and strict on how these are defined.
In my example here we can see that row 5 has null device_id row 8 has text instead of a number for device_id, 9 onwards don’t even reference a device_id.
As you curate your data, it’s less likely that Silver tables will contain variant types as you’ve started extracting all the relevant parts for your end users.
You can start manipulating the Variant column like you would any Struct or Array, but there is slightly different syntax for some functionality. In place of explode() you would use variant_explode(). The errors are not indecipherable, but not helpful in pointing you in the right direction either.
Check out the docs for more information.
Docs - Official Blog - Engineering talk - Spark Github - Delta Github
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.