Databricks Community

SergeRielau · ‎03-29-2024

Inuktitut, the language of the Inuit, has 50 words for snow and ice.

That’s - as they say - fake news, but the point made is metaphorical:

When something is important to a people, their language finds a way to become efficient on that topic.

SQL is all about transforming and filtering data. So it should come as no surprise that SQL has sprouted a whole lot more ways to eliminate datapoints than the humble WHERE. In this post we broaden our palate to speak SQL a bit more … refined.

Warm Up

Note: You can find all examples to run here.

The obvious and most SQL way to filter rows from a table is the WHERE clause.

DROP TABLE IF EXISTS products;
CREATE TABLE products(name STRING, category STRING, price INTEGER, units INTEGER, description STRING);
INSERT INTO products VALUES
('colander'           , 'housewares'         ,  10,  1, 'Bowl with large holes to drain water from past and vegetables'),
('coffee filter'      , 'house supplies'     ,   4, 20, 'Holds ground coffee'                                          ),
('skimmer'            , 'pool supplies'      , 100,  1, 'Drain filtering leaves from the water surface'                ),
('diatomaceous earth' , 'pool supplies'      ,  30,  0, 'Coats pool filter to filter impurities from water'            ),
('strainer'           , 'house wares'        ,  15,  1, 'Bowl with a handle and fine holes'                            ),
('tea infuser'        , 'house wares'        ,   5,  1, 'Holds loose tea leaves for infusion'                          ),
('electrostatic air filter', 'HVAC hardware' ,  10,  1, 'Air filter for HVAC systems'                                  );

SELECT * FROM products WHERE category = 'pool supplies';

name         
skimmer
diatomaceous

But even WHERE has some properties that can trip some users up. First, the WHERE clause applies before the SELECT list items are executed. That has consequences:

The WHERE clause protects expressions in the SELECT list from runtime errors, such as division by zero.

SELECT name, price / units AS unitprice
  FROM products
  WHERE category = 'pool supplies' AND units != 0;
name    unitprice
skimmer       100

The WHERE clause cannot reference the results (i.e. the column aliases) of the SELECT list.
Where “lateral column alias” within the select is handy, it is outright wrong when applied to the WHERE clause.

SELECT name, price / nullif(units, 0) AS unitprice
  FROM products
  WHERE unitprice IS NULL;
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `unitprice` cannot be resolved.
Did you mean one of the following? [`price`, `units`, `name`, `category`, `description`]. SQLSTATE: 42703

Second, the WHERE clause is not the first filter to be applied.The first filter to be applied is the ON clause of any joins. In many cases this doesn’t matter ...

SELECT name, price * number AS total
  FROM products
  NATURAL JOIN VALUES('skimmer' , 1),
                     ('colander', 2) AS orders(name, number)
  WHERE category = 'housewares';
name     total
colander    20

... until it does. Most systems, including Databricks, are rather cavalier on that matter.

SELECT name, price * number AS total
  FROM products
  NATURAL JOIN VALUES('skimmer' , 1),
                     ('colander', 2) AS orders(name, number)
  WHERE price / units > 1;
[DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. SQLSTATE: 22012
== SQL (line 5, position 7) ==
WHERE price / units > 1
      ^^^^^^^^^^^^^

Third, there is no promise of order of execution within the WHERE clause itself. There isn’t even a promise that every expression within a boolean expression is in fact executed!

HAVING

As far as row filtering is concerned the WHERE clause is really all SQL needs. Thanks to nesting of subqueries we can easily answer a question like:

Give me all categories with more than one product.

SELECT *
  FROM (SELECT category, count(*) AS num_items
         FROM products
         GROUP BY ALL)
  WHERE num_items > 1;
category      num_items
housewares            2
pool supplies         2

This pattern is so frequent, that SQL sprouted its first extension in order to make this query more succinct:

SELECT category, count(*) AS num_items
  FROM products
  GROUP BY ALL
  HAVING num_items > 1;
category      num_items
housewares            2
pool supplies         2

The HAVING clause is nothing more than syntactic sugar for:

pushing a subquery
selecting all of the subquery's columns, and
adding a new WHERE clause.

As far as bedrock ANSI SQL 92 and row filtering is concerned: ON, WHERE, and HAVING are pretty much it. There is of course LIMIT (called FETCH FIRST n ROWS in ANSI)... hold that thought.

A more interesting area to look at next is filtering on column values, or more precisely aggregates.

Aggregate level filtering

By default, aggregate functions operate on all rows within a group. For example, we can count products per category.

SELECT category, count(*) AS num_items
  FROM products GROUP BY ALL;
category      num_items
HVAC hardware         1
housewares            4
pool supplies         2

If we want to differentiate between consumables (items we need to keep buying) and non-consumables we need to add that column to the GROUP BY:

SELECT category, consumable, count(*) AS num_items
  FROM products GROUP BY ALL;
category      consumable num_items
pool supplies false              1
housewares    false              3
pool supplies true               1
HVAC hardware false              1
housewares    true               1

We get two rows per category: One for consumables, one for non-consumables. However, this may not be what we wanted. We may want one line per category and separate columns for consumable and non-consumable.

We can pivot the table, but that’s heavy handed. An easier way is to control the aggregation itself:

SELECT category,
       sum(CASE WHEN consumable THEN 1 ELSE 0 END)  AS consumable,
       sum(CASE WHEN !consumable THEN 1 ELSE 0 END) AS non_consumable
  FROM products
  GROUP BY ALL;
category      consumable non_consumable
HVAC hardware          0              1
housewares             1              3
pool supplies          1              1

There are any number of variations other than CASE.

This pattern of conditional counting is so frequently used that a dedicated aggregation function is available to do this with less keystrokes:

SELECT category,
       count_if(consumable)  AS consumable,
       count_if(!consumable) AS non_consumable
 FROM products
 GROUP BY ALL;
category      consumable non_consumable
HVAC hardware          0              1
housewares             1              3
pool supplies          1              1

Great! But we left all those other aggregations out in the cold. What if we wanted to also aggregate the price of these items?

SELECT category,
       count_if(consumable) AS consumable,
       sum(CASE WHEN consumable THEN price ELSE 0 END) AS consumable_price,
       count_if(!consumable) AS non_consumable,
       sum(CASE WHEN !consumable THEN price ELSE 0 END) AS non_consumable_price
  FROM products
  GROUP BY ALL;
category      consumable consumable_price non_consumable non_consumable_price
HVAC hardware          0                0              1                   10
housewares             1                4              3                   30
pool supplies          1               30              1                  100

The language could have spawned sum_if(), avg_if() etc… just like Databricks added try_* functions.
Do we really want to introduce: try_sum_if()?

It seems any aggregate function could potentially need some sort of postfix that controls whether a value within the group should or should not contribute to the aggregation. There are two choices to here beyond exploding the number of functions:

Add an optional boolean parameter to each aggregate function
Add a SQL syntax clause

Option 2 has won out and FILTER was born.

SELECT category,
       count(1)   FILTER(WHERE  consumable) AS consumable,
       sum(price) FILTER(WHERE  consumable) AS consumable_price,
       count(1)   FILTER(WHERE !consumable) AS noon_consumable,
       sum(price) FILTER(WHERE !consumable) AS non_consumable_price
  FROM products
  GROUP BY ALL;
category      consumable consumable_price noon_consumable non_consumable_price
HVAC hardware          0                                1 
housewares             1                4               3                    4
pool supplies          1               30               1                   30

Unlike the CASE expression pattern though, FILTER does not rely on the existence of a neutral input argument. It is truly more expressive.

Note: It's easy to predict that eventually there will be a subquery level FILTER clause, just like the repetitive nature of the window aggregation resulted in the WINDOW clause.

Talking of window clauses, let's return to row filtering.

No LIMITs

Earlier we glossed over LIMIT; let’s now have a closer look.

Aside from the obvious

Give me a page's worth of results only, I’m not going to look at the rest anyway!

use case, there is a common pattern to retrieve to “maximum row".

What is the priciest product? And give me that entire row!

SELECT *
  FROM products
  ORDER BY price DESC LIMIT 1;
name    category      price units consumable description                                   
skimmer pool supplies   100     1 false      Drain filtering leaves from the water surface

Alas, things get tricky if we want to return the priciest product by category, and we want more than just the price. The classic solution is use an EXISTS and a self join:

SELECT p.*
  FROM products AS p
  WHERE EXISTS (SELECT 1 
                  FROM products AS m
                 WHERE m.category = p.category
                HAVING max(m.price) = p.price);
name                     category      price units consumable description                                   
skimmer                  pool supplies   100     1 false      Drain filtering leaves from the water surface
strainer                 housewares       15     1 false      Bowl with a handle and fine holes
electrostatic air filter HVAC hardware    10     1 false      Air filter for HVAC systems

Another solution, without the self join, is to use the max_by() aggregate function which returns a value based on another value's max.

SELECT max_by(name, price) AS name,
       category,
       max(price) AS price,
       max_by(units, price) AS units,
       max_by(consumable, price) AS consumable,
       max_by(description, price) AS description
  FROM products
  GROUP BY ALL;
name                     category      price units consumable description                                  
electrostatic air filter HVAC hardware    10     1 false      Air filter for HVAC systems
strainer                 housewares       15     1 false      Bowl with a handle and fine holes
skimmer                  pool supplies   100     1 false      Drain filtering leaves from the water surface

There is a problem though: price can hardly be assumed to be unique by category. We are walking on thin ice assuming max_by() will not deliver us a chimera of a product mixing different values from products at the same price.

Yet another, safer, way is to employ window aggregation. In this approach we rank each product within its category.
Then we filter out only the winners per category, and remove the rank itself.

SELECT * EXCEPT(rank)
  FROM (SELECT row_number() OVER (PARTITION BY category ORDER BY price DESC) AS rank,
               *
          FROM products)
         WHERE rank = 1;
name                     category      price units consumable description                                  
electrostatic air filter HVAC hardware    10     1 false      Air filter for HVAC systems
strainer                 housewares       15     1 false      Bowl with a handle and fine holes
skimmer                  pool supplies   100     1 false      Drain filtering leaves from the water surface

Thank goodness for the EXCEPT syntax, which made this manageable. But we had to push a subquery, just like we had to do at the beginning of the post when we "invented" HAVING.

Again, this pattern is popular enough that SQL found a way. It's called the QUALIFY clause and it operates on expressions involving window functions only.

SELECT *
  FROM products
  QUALIFY row_number() OVER(PARTITION BY category ORDER BY price DESC) = 1;
name                     category      price units consumable description                                  
electrostatic air filter HVAC hardware    10     1 false      Air filter for HVAC systems
strainer                 housewares       15     1 false      Bowl with a handle and fine holes
skimmer                  pool supplies   100     1 false      Drain filtering leaves from the water surface

The trick with QUALIFY is not so much that it does allow the window function, but that it executes after the SELECT list, just like HAVING.

We can make another prediction: Sooner rather than later QUALIFY will be generalized to allow any predicate operating on any column in the SELECT list.

Wouldn't that be nice?

Conclusion

SQL provides a whole lot more than just WHERE to filter data:

HAVING filters on groups
FILTER refines aggregation
QUALIFY sieves through selected rows.

Databricks Community

Highly selective: SQL refined beyond the WHERE

Warm Up

HAVING

Aggregate level filtering

No LIMITs

Conclusion

Related

Metadata-Driven ETL Framework in Databricks (Part-1)

Top 10 query performance tuning tips for Databricks Serverless SQL

Best practices for safe data experimentation with Databricks