cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
SergeRielau
Valued Contributor

star.pngThis is an article about a humble and ancient SQL token: The asterisk, aka "star", aka "*". Not to be confused with Asterix the Gaulle, for him I recommend a decent bookstore. It's also an article about how I'm humbled by the secrets Spark SQL holds which, after 3 1/2 years at Databricks, I'm still discovering. Unless you are a serious Spark committer, chances are some of what you read here will blow your mind as well as it blew mine.

The Lone 

We all know and love * (star).
Star, in its core, is simply a short hand for:
"Insert all columns from all relations in the FROM clause. Do so in the order the relations appear in the syntax, and in the order the columns are appearing in each relation."

SELECT *
FROM VALUES( 1, 2, 3) AS t(a, b, c),
VALUES(10, 20, 30) AS s(x, y, z);
a b c x y z
-- -- -- -- -- --
1 2 3 10 20 30

You also know that star can be pre-fixed with the name of a relation in the FROM clause.
In that case it is a short hand for:
"Insert all columns of this relation in the order they appear."

SELECT *
FROM VALUES( 1, 2, 3) AS t(a, b, c),
VALUES(10, 20, 30) AS s(x, y, z);
x y z
-- -- --
10 20 30

Star does not need to be the sole entry in the SELECT list. You can add additional expressions and multiple stars.

SELECT *, 'Hello' AS hello, t.*, a + x AS ax, s.* 
FROM VALUES( 1, 2, 3) AS t(a, b, c),
VALUES(10, 20, 30) AS s(x, y, z);
a b c x y z hello a b ax x y z
-- -- -- -- -- -- ----- -- -- -- -- -- --
1 2 3 10 20 30 Hello 1 b 11 10 20 30

So far so good. This is the baseline which most products support. Star can also be used with count(*), but that's besides the point.

The EXCEPTional

"All columns" is such an absolutist thing to say. Clearly, if you just want a "few" columns you can just specify them. But you want "most" or "nearly all" columns (like my son who, with surgical precision, works around the food on the plate he does not like to claim he ate nearly all of it and can be dismissed from the table).
For this purpose Databricks introduced the EXCEPT clause. It works as you expect:

SELECT * EXCEPT (b, y)
FROM VALUES( 1, 2, 3) AS t(a, b, c),
VALUES(10, 20, 30) AS s(x, y, z);
a c x z
-- -- -- --
1 3 10 30

EXCEPT does have another trick up its sleeve though. It can, with surgical precision matching that of my son's, carve out fields it distastes from a struct:

SELECT * EXCEPT (b, d.y)
FROM VALUES(1, 2, 3, named_struct('x', 10, 'y', 20, 'z', 30)) AS t(a, b, c, d);
a c d
-- -- -------------
1 3 {x: 10 y: 30}

This much you may know as a user of Databricks who reads the SQL Reference (Hey, I can dream, can I not?) or at least follows the release notes.

But now we're going to enter previously uncharted territory. Uncharted because at least I did not even know that I did not know.

STRUCT Episode 1

At the beginning of this blog I noted that you can limit the star to a specific table in the FROM column by prefixing it with a relation name.
What if I tell you that you can also prefix it with a column or (hang on to your seat!) a field name?

SELECT * EXCEPT (d), d.* EXCEPT(y)
FROM VALUES( 1, 2, 3, named_struct('x', 10, 'y', 20, 'z', 30)) AS t(a, b, c, d);
a b c x z
-- -- -- -- --
1 2 3 10 30

As you can see you can very easily flatten a struct. And it is fully composable with EXCEPT and regular star usage.

Dare we dream? Here it comes...

STRUCT Episode 2

You can actually nest star inside of the struct function. Databricks will expand all columns as arguments to the struct function and struct will pick up the column names as field names:

SELECT a, struct(* EXCEPT (a)) AS struct
FROM VALUES(1, 2, 3) AS t(a, b, c);
a struct
-- ------------
1 {b: 2, c: 3}

Now that's what I call cool! But is it the best star has to offer? Not by a long shot.

ARRAYsing

If you think turning columns into struct with a few keystrokes is nice, how about turning column values into an array?

SELECT a, array(* EXCEPT (a)) AS array
FROM VALUES(1, 2, 3) AS t(a, b, c);
a array
-- ------
1 [2, 3]

So, are ARRAy and STRUCT special in their support for star, just like COUNT(*)?

The GREATEST and many more

It turns out you can use star on any function. The function will simply accept the columns expanded by star as as arguments and do what it always does. Naturally the most interesting applications are those of functions with variable number of arguments, such as:

  • GREATEST
  • LEAST
  • COALESCE
  • CONCAT
  • CONCAT_WS

Here we turn all columns into a comma separated string:

SELECT concat_ws(', ', *) AS result
FROM VALUES(1, 2, 3) AS t(a, b, c);
result
-------
1, 2, 3

It doesn't matter where the star is, as long as it is within the SELECT list.

MAP?

While you can use * on MAP, it is a function after all, MAP takes key value pairs, unlike STRUCT it does not derive keys from the column names.
So you can do:

SELECT map(*) AS result
FROM VALUES('a', 1, 'b', 2, 'c', 3) AS t(key1, val1, key2, val2, key3, val3);
result
------------------
{a->1, b->2, c->3}

But this isn't really that useful.

One option I see in the future is: 

 map ( structExpr )

which turns a struct into map.

CoLATERAL

At the very beginning I state that star covers all the columns in all the relations in the FROM clause.
This is true for the unqualified star:

SELECT x.*
FROM VALUES(1, 2) AS t(a, b),
LATERAL(SELECT *
FROM VALUES(10, 20) AS s(x, y)) AS x;
x y
-- --
10 20

But we can actually explicitly refer to a LATERAL relation using star:

SELECT x.*
FROM VALUES(1, 2) AS t(a, b),
LATERAL(SELECT t.*
FROM VALUES(10, 20) AS s(x, y)) AS x;
a b
-- --
1 2

That's interesting in the abstract, but it opens the door for future exploitation.

To the and Beyond

There are more possible futures for the usage of star:

  • Some vendors, including Spark, have started experimenting with regular expression matching.
  • There is no reason to limit star to the SELECT clause. E,g. WHERE ? IN (*) could be fun.
  • LATERAL Star on varg style table functions could be interesting.

Conclusion: for

I hope you had fun discovering what lies beyond the simple Star with me.
It makes me wonder what other easter eggs are hidden in Spark that I have yet to discover.

References:

4 Comments
glater
New Contributor II

Great info!
I'll have to try those at work!

Just a very humble and simple observation: this (great) article would greatly benefit from using a simple example table, instead of A, B, C, 1, 2, 3.
I'm sure I'm not the only average SQL user who can't really follow the examples provided.

SergeRielau
Valued Contributor

@glater Thanks, I have to fix a couple of typos. Perhaps I can improve the examples along the way.
But definitely good, and much appreciated advise moving forward.

zshao
New Contributor II

Great feature!

 

SELECT * EXCEPT (b, d.y)
FROM VALUES(1, 2, 3, named_struct('x', 10, 'y', 20, 'z', 30)) AS t(a, b, c, d);
a c d
-- -- -------------
1 3 {x: 10 y: 30}

d in the output should be "{x: 10 z: 30}".

SergeRielau
Valued Contributor

@zshao Thanks. I'm working on fixing typos, providing a notebook and generally making the examples more useful.

Contributors