cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
SergeRielau
Valued Contributor
Valued Contributor

Inuktitut, the language of the Inuit, has 50 words for snow and ice.

That’s - as they say - fake news, but the point made is metaphorical:

When something is important to a people, their language finds a way to become efficient on that topic.

SQL is all about transforming and filtering data. So it should come as no surprise that SQL has sprouted a whole lot more ways to eliminate datapoints than the humble WHERE. In this post we broaden our palate to speak SQL a bit more … refined.

filterserge.png

Warm Up

Note: You can find all examples to run here.

The obvious and most SQL way to filter rows from a table is the WHERE clause.

DROP TABLE IF EXISTS products;
CREATE
TABLE products(name STRING, category STRING, price INTEGER, units INTEGER, description STRING);
INSERT INTO products VALUES
('colander' , 'housewares' , 10, 1, 'Bowl with large holes to drain water from past and vegetables'),
('coffee filter' , 'house supplies' , 4, 20, 'Holds ground coffee' ),
('skimmer' , 'pool supplies' , 100, 1, 'Drain filtering leaves from the water surface' ),
('diatomaceous earth' , 'pool supplies' , 30, 0, 'Coats pool filter to filter impurities from water' ),
('strainer' , 'house wares' , 15, 1, 'Bowl with a handle and fine holes' ),
('tea infuser' , 'house wares' , 5, 1, 'Holds loose tea leaves for infusion' ),
('electrostatic air filter', 'HVAC hardware' , 10, 1, 'Air filter for HVAC systems' );

SELECT * FROM products WHERE category = 'pool supplies';

name
skimmer
diatomaceous

But even WHERE has some properties that can trip some users up. First, the WHERE clause applies before the SELECT list items are executed. That has consequences:

  • The WHERE clause protects expressions in the SELECT list from runtime errors, such as division by zero.
SELECT name, price / units AS unitprice
FROM products
WHERE category = 'pool supplies' AND units != 0;
name unitprice
skimmer 100
  • The WHERE clause cannot reference the results (i.e. the column aliases) of the SELECT list.
    Where “lateral column alias” within the select is handy, it is outright wrong when applied to the WHERE clause.
SELECT name, price / nullif(units, 0) AS unitprice
FROM products
WHERE unitprice IS NULL;
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `unitprice` cannot be resolved.
Did you mean one of the following? [`price`, `units`, `name`, `category`, `description`]. SQLSTATE: 42703

Second, the WHERE clause is not the first filter to be applied.The first filter to be applied is the ON clause of any joins. In many cases this doesn’t matter ...

SELECT name, price * number AS total
FROM products
NATURAL JOIN VALUES('skimmer' , 1),
('colander', 2) AS orders(name, number)
WHERE category = 'housewares';
name total
colander 20

... until it does. Most systems, including Databricks, are rather cavalier on that matter.

SELECT name, price * number AS total
FROM products
NATURAL JOIN VALUES('skimmer' , 1),
('colander', 2) AS orders(name, number)
WHERE price / units > 1;
[DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. SQLSTATE: 22012
== SQL (line 5, position 7) ==
WHERE price / units > 1
^^^^^^^^^^^^^


Third, there is no promise of order of execution within the WHERE clause itself. There isn’t even a promise that every expression within a boolean expression is in fact executed!

 

HAVING

As far as row filtering is concerned the WHERE clause is really all SQL needs. Thanks to nesting of subqueries we can easily answer a question like:

Give me all categories with more than one product.

SELECT *
FROM (SELECT category, count(*) AS num_items
FROM products
GROUP BY ALL)
WHERE num_items > 1;
category num_items
housewares 2
pool supplies 2

 This pattern is so frequent, that SQL sprouted its first extension in order to make this query more succinct:

SELECT category, count(*) AS num_items
FROM products
GROUP BY ALL
HAVING num_items > 1;
category num_items
housewares 2
pool supplies 2

The HAVING clause is nothing more than syntactic sugar for:

  • pushing a subquery
  • selecting all of the subquery's columns, and
  • adding a new WHERE clause.

As far as bedrock ANSI SQL 92 and row filtering is concerned: ON, WHERE, and HAVING are pretty much it. There is of course LIMIT (called FETCH FIRST n ROWS in ANSI)... hold that thought.

A more interesting area to look at next is filtering on column values, or more precisely aggregates.

 

Aggregate level filtering

By default, aggregate functions operate on all rows within a group. For example, we can count products per category.

SELECT category, count(*) AS num_items
FROM products GROUP BY ALL;
category num_items
HVAC hardware 1
housewares 4
pool supplies 2

If we want to differentiate between consumables (items we need to keep buying) and non-consumables we need to add that column to the GROUP BY:

SELECT category, consumable, count(*) AS num_items
FROM products GROUP BY ALL;
category consumable num_items
pool supplies false 1
housewares false 3
pool supplies true 1
HVAC hardware false 1
housewares true 1

We get two rows per category: One for consumables, one for non-consumables. However, this may not be what we wanted. We may want one line per category and separate columns for consumable and non-consumable.

We can pivot the table, but that’s heavy handed. An easier way is to control the aggregation itself:

SELECT category,
sum(CASE WHEN consumable THEN 1 ELSE 0 END) AS consumable,
sum(CASE WHEN !consumable THEN 1 ELSE 0 END) AS non_consumable
FROM products
GROUP BY ALL;
category consumable non_consumable
HVAC hardware 0 1
housewares 1 3
pool supplies 1 1

There are any number of variations other than CASE.

This pattern of conditional counting is so frequently used that a dedicated aggregation function is available to do this with less keystrokes:

SELECT category,
count_if(consumable) AS consumable,
count_if(!consumable) AS non_consumable
FROM products
GROUP BY ALL;
category consumable non_consumable
HVAC hardware 0 1
housewares 1 3
pool supplies 1 1

Great! But we left all those other aggregations out in the cold. What if we wanted to also aggregate the price of these items?

SELECT category,
count_if(consumable) AS consumable,
sum(CASE WHEN consumable THEN price ELSE 0 END) AS consumable_price,
count_if(!consumable) AS non_consumable,
sum(CASE WHEN !consumable THEN price ELSE 0 END) AS non_consumable_price
FROM products
GROUP BY ALL;
category consumable consumable_price non_consumable non_consumable_price
HVAC hardware 0 0 1 10
housewares 1 4 3 30
pool supplies 1 30 1 100

The language could have spawned sum_if(), avg_if() etc… just like Databricks added try_* functions.
Do we really want to introduce: try_sum_if()?

It seems any aggregate function could potentially need some sort of postfix that controls whether a value within the group should or should not contribute to the aggregation. There are two choices to here beyond exploding the number of functions:

  1. Add an optional boolean parameter to each aggregate function
  2. Add a SQL syntax clause

Option 2 has won out and FILTER was born.

SELECT category,
count(1) FILTER(WHERE consumable) AS consumable,
sum(price) FILTER(WHERE consumable) AS consumable_price,
count(1) FILTER(WHERE !consumable) AS noon_consumable,
sum(price) FILTER(WHERE !consumable) AS non_consumable_price
FROM products
GROUP BY ALL;
category consumable consumable_price noon_consumable non_consumable_price
HVAC hardware 0 1
housewares 1 4 3 4
pool supplies 1 30 1 30

Unlike the CASE expression pattern though, FILTER does not rely on the existence of a neutral input argument. It is truly more expressive.

Note: It's easy to predict that eventually there will be a subquery level FILTER clause, just like the repetitive nature of the window aggregation resulted in the WINDOW clause.

Talking of window clauses, let's return to row filtering.

 

No LIMITs

Earlier we glossed over LIMIT; let’s now have a closer look.

Aside from the obvious

Give me a page's worth of results only, I’m not going to look at the rest anyway!

use case,  there is a common pattern to retrieve to “maximum row".

What is the priciest product?  And give me that entire row!

SELECT *
FROM products
ORDER BY price DESC LIMIT 1;
name category price units consumable description
skimmer pool supplies 100 1 false Drain filtering leaves from the water surface

Alas, things get tricky if we want to return the priciest product by category, and we want more than just the price. The classic solution is use an EXISTS and a self join:

SELECT p.*
FROM products AS p
WHERE EXISTS (SELECT 1
FROM products AS m
WHERE m.category = p.category
HAVING max(m.price) = p.price);
name category price units consumable description
skimmer pool supplies 100 1 false Drain filtering leaves from the water surface
strainer housewares 15 1 false Bowl with a handle and fine holes
electrostatic air filter HVAC hardware 10 1 false Air filter for HVAC systems


Another solution, without the self join, is to use the max_by() aggregate function which returns a value based on another value's max.

SELECT max_by(name, price) AS name,
category,
max(price) AS price,
max_by(units, price) AS units,
max_by(consumable, price) AS consumable,
max_by(description, price) AS description
FROM products
GROUP BY ALL;
name category price units consumable description
electrostatic air filter HVAC hardware 10 1 false Air filter for HVAC systems
strainer housewares 15 1 false Bowl with a handle and fine holes
skimmer pool supplies 100 1 false Drain filtering leaves from the water surface

There is a problem though: price can hardly be assumed to be unique by category. We are walking on thin ice assuming max_by() will not deliver us a chimera of a product mixing different values from products at the same price.

Yet another, safer, way is to employ window aggregation. In this approach we rank each product within its category.
Then we filter out only the winners per category, and remove the rank itself.

SELECT * EXCEPT(rank)
FROM (SELECT row_number() OVER (PARTITION BY category ORDER BY price DESC) AS rank,
*
FROM products)
WHERE rank = 1;
name category price units consumable description
electrostatic air filter HVAC hardware 10 1 false Air filter for HVAC systems
strainer housewares 15 1 false Bowl with a handle and fine holes
skimmer pool supplies 100 1 false Drain filtering leaves from the water surface

Thank goodness for the EXCEPT syntax, which made this manageable. But we had to push a subquery, just like we had to do at the beginning of the post when we "invented" HAVING.

Again, this pattern is popular enough that SQL found a way. It's called the QUALIFY clause and it operates on expressions involving window functions only.

SELECT *
FROM products
QUALIFY row_number() OVER(PARTITION BY category ORDER BY price DESC) = 1;
name category price units consumable description
electrostatic air filter HVAC hardware 10 1 false Air filter for HVAC systems
strainer housewares 15 1 false Bowl with a handle and fine holes
skimmer pool supplies 100 1 false Drain filtering leaves from the water surface

The trick with QUALIFY is not so much that it does allow the window function, but that it executes after the SELECT list, just like HAVING.

We can make another prediction: Sooner rather than later QUALIFY will be generalized to allow any predicate operating on any column in the SELECT list.

Wouldn't that be nice?

 

Conclusion

SQL provides a whole lot more than just WHERE to filter data:

  • HAVING filters on groups
  • FILTER refines aggregation
  • QUALIFY sieves through selected rows.

Related

 

Contributors