This is an article about a humble and ancient SQL token: The asterisk, aka "star", aka "*". Not to be confused with Asterix the Gaulle, for him I recommend a decent bookstore. It's also an article about how I'm humbled by the secrets Spark SQL holds which, after 3 1/2 years at Databricks, I'm still discovering. Unless you are a serious Spark committer, chances are some of what you read here will blow your mind as well as it blew mine.
We all know and love * (star).
Star, in its core, is simply a short hand for:
"Insert all columns from all relations in the FROM clause. Do so in the order the relations appear in the syntax, and in the order the columns are appearing in each relation."
SELECT *
FROM VALUES( 1, 2, 3) AS t(a, b, c),
VALUES(10, 20, 30) AS s(x, y, z);
a b c x y z
-- -- -- -- -- --
1 2 3 10 20 30
You also know that star can be pre-fixed with the name of a relation in the FROM clause.
In that case it is a short hand for:
"Insert all columns of this relation in the order they appear."
SELECT *
FROM VALUES( 1, 2, 3) AS t(a, b, c),
VALUES(10, 20, 30) AS s(x, y, z);
x y z
-- -- --
10 20 30
Star does not need to be the sole entry in the SELECT list. You can add additional expressions and multiple stars.
SELECT *, 'Hello' AS hello, t.*, a + x AS ax, s.*
FROM VALUES( 1, 2, 3) AS t(a, b, c),
VALUES(10, 20, 30) AS s(x, y, z);
a b c x y z hello a b ax x y z
-- -- -- -- -- -- ----- -- -- -- -- -- --
1 2 3 10 20 30 Hello 1 b 11 10 20 30
So far so good. This is the baseline which most products support. Star can also be used with count(*), but that's besides the point.
"All columns" is such an absolutist thing to say. Clearly, if you just want a "few" columns you can just specify them. But you want "most" or "nearly all" columns (like my son who, with surgical precision, works around the food on the plate he does not like to claim he ate nearly all of it and can be dismissed from the table).
For this purpose Databricks introduced the EXCEPT clause. It works as you expect:
SELECT * EXCEPT (b, y)
FROM VALUES( 1, 2, 3) AS t(a, b, c),
VALUES(10, 20, 30) AS s(x, y, z);
a c x z
-- -- -- --
1 3 10 30
EXCEPT does have another trick up its sleeve though. It can, with surgical precision matching that of my son's, carve out fields it distastes from a struct:
SELECT * EXCEPT (b, d.y)
FROM VALUES(1, 2, 3, named_struct('x', 10, 'y', 20, 'z', 30)) AS t(a, b, c, d);
a c d
-- -- -------------
1 3 {x: 10 y: 30}
This much you may know as a user of Databricks who reads the SQL Reference (Hey, I can dream, can I not?) or at least follows the release notes.
But now we're going to enter previously uncharted territory. Uncharted because at least I did not even know that I did not know.
At the beginning of this blog I noted that you can limit the star to a specific table in the FROM column by prefixing it with a relation name.
What if I tell you that you can also prefix it with a column or (hang on to your seat!) a field name?
SELECT * EXCEPT (d), d.* EXCEPT(y)
FROM VALUES( 1, 2, 3, named_struct('x', 10, 'y', 20, 'z', 30)) AS t(a, b, c, d);
a b c x z
-- -- -- -- --
1 2 3 10 30
As you can see you can very easily flatten a struct. And it is fully composable with EXCEPT and regular star usage.
Dare we dream? Here it comes...
You can actually nest star inside of the struct function. Databricks will expand all columns as arguments to the struct function and struct will pick up the column names as field names:
SELECT a, struct(* EXCEPT (a)) AS struct
FROM VALUES(1, 2, 3) AS t(a, b, c);
a struct
-- ------------
1 {b: 2, c: 3}
Now that's what I call cool! But is it the best star has to offer? Not by a long shot.
If you think turning columns into struct with a few keystrokes is nice, how about turning column values into an array?
SELECT a, array(* EXCEPT (a)) AS array
FROM VALUES(1, 2, 3) AS t(a, b, c);
a array
-- ------
1 [2, 3]
So, are ARRAy and STRUCT special in their support for star, just like COUNT(*)?
It turns out you can use star on any function. The function will simply accept the columns expanded by star as as arguments and do what it always does. Naturally the most interesting applications are those of functions with variable number of arguments, such as:
Here we turn all columns into a comma separated string:
SELECT concat_ws(', ', *) AS result
FROM VALUES(1, 2, 3) AS t(a, b, c);
result
-------
1, 2, 3
It doesn't matter where the star is, as long as it is within the SELECT list.
While you can use * on MAP, it is a function after all, MAP takes key value pairs, unlike STRUCT it does not derive keys from the column names.
So you can do:
SELECT map(*) AS result
FROM VALUES('a', 1, 'b', 2, 'c', 3) AS t(key1, val1, key2, val2, key3, val3);
result
------------------
{a->1, b->2, c->3}
But this isn't really that useful.
One option I see in the future is:
map ( structExpr )
which turns a struct into map.
At the very beginning I state that star covers all the columns in all the relations in the FROM clause.
This is true for the unqualified star:
SELECT x.*
FROM VALUES(1, 2) AS t(a, b),
LATERAL(SELECT *
FROM VALUES(10, 20) AS s(x, y)) AS x;
x y
-- --
10 20
But we can actually explicitly refer to a LATERAL relation using star:
SELECT x.*
FROM VALUES(1, 2) AS t(a, b),
LATERAL(SELECT t.*
FROM VALUES(10, 20) AS s(x, y)) AS x;
a b
-- --
1 2
That's interesting in the abstract, but it opens the door for future exploitation.
There are more possible futures for the usage of star:
I hope you had fun discovering what lies beyond the simple Star with me.
It makes me wonder what other easter eggs are hidden in Spark that I have yet to discover.
References:
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.