Databricks

SergeRielau · ‎03-06-2024

Motivation

Note: You can find all examples to run here.

In past posts, we discussed parameter markers that you can use to templatize queries.

Given a simple example table:

CREATE OR REPLACE TABLE residents(
    first_name STRING,
    last_name  STRING,
    address    STRUCT<street STRING, city STRING, zip INT>);
INSERT INTO residents VALUES 
('Jason', 'Jones' , struct('100 Jewel St.' , 'Jasper' , 12345)),
('Jane' , 'Jones' , struct('12 Jello Ct.' , 'Jericho' , 54321)),
('Frank', 'Francis', struct('Wormser Str. 4', 'Kaiserslautern', 67657));

Instead of:

SELECT first_name FROM residents WHERE last_name = 'Jones';
Jason
Jane

You can write:

SELECT * FROM residents WHERE last_name = ?;

You can then “bind” the values to the query:

spark.sql("SELECT first_name, last_name, address.city"
          "  FROM residents WHERE last_name = ?",
          args = [ "Jones" ] ).show()
+----------+---------+-------+
|first_name|last_name| city  |
+----------+---------+-------+
| Jason    | Jones   | Jasper|
| Jane     | Jones   |Jericho|
+----------+---------+-------+

The benefit of this is two-fold:

Separating the fixed portion from the variable portion of the logic makes it more readable
This form of passing values to the query prevents SQL injection attacks

Of course, you can also use session variables to achieve the same effect:

DECLARE last_name = 'Jones';
SELECT first_name FROM residents WHERE last_name = session.last_name;
Jason
Jane

This works great for fixed queries. All you want to vary are the arguments to filters.

If you need to change the structure of the query, however, you need to resort to composing the SQL using Python or some other language. Or by using session variables in combination with the new EXECUTE IMMEDIATE:

SET VAR last_name = 'Jones';
DECLARE OR REPLACE first_name STRING;

DECLARE OR REPLACE stmt_head STRING
   DEFAULT 'SELECT first_name, last_name, address.city FROM residents ';
DECLARE OR REPLACE stmt STRING;
SET VAR stmt = stmt_head ||
       CASE WHEN last_name IS NOT NULL AND first_name IS NOT NULL 
            THEN 'WHERE last_name = session.last_name AND first_name = session.first_name'
            WHEN last_name IS NOT NULL
            THEN 'WHERE last_name = session.last_name'
            WHEN first_name IS NOT NULL
            THEN 'WHERE first_name = session.first_name'
       ELSE '' END;
SELECT stmt;
SELECT * FROM residents WHERE last_name = session.last_name

EXECUTE IMMEDIATE stmt;
Jason  Jones  Jasper
Jane   Jones  Jericho

While the above example is perhaps contrived, composing SQL like this is common and can be a security as well as a maintenance nightmare. If one of the fragments injected originates from the user it can be used to meaningfully change the query.

As soon as we introduced parameters and session variables, users demanded something in the middle. Users want to do more than binding in values and less than building the structure of the SQL statement:

Users want to parameterize table names, column names, and so on.

This is what the IDENTIFIER clause enables. IDENTIFIER allows the parameterization of names in a query without opening up the entire SQL to injection attacks.

How it works

The idea is simple:

When presented with a STRING the IDENTIFIER clause parses the STRING as an identifier and returns that identifier allowing the compilation of the query to continue.

For example:

DECLARE my_table = 'residents';
SELECT first_name FROM IDENTIFIER(my_table);
Jason
Jane
Frank

As you see above, the argument needs not to be a literal, such a restriction would make the feature rather pointless.
It can also be a parameter marker, or a session variable. In fact, the argument can be any constant expression returning a STRING. A constant expression is composed of:

Literals
Session variables
Certain functions such as current_schema() returning the same result throughout the duration of a query.
Any deterministic built-in functions.

Here is an example concatenating a table name - and also using explicit backticks:

SET VAR my_table = 'dents';
SELECT first_name FROM IDENTIFIER('`resi' || my_table || '`');
Jason
Jane
Frank

We can use an expression to build a qualified identifier, for example to parameterize the schema name:

DECLARE OR REPLACE my_schema = current_schema();
SELECT first_name FROM IDENTIFIER(my_schema || '.residents');
Jason
Jane
Frank

And you can use multiple session variables so both schema and table names are parameterized:

SET VAR my_table = 'residents';
SELECT first_name FROM IDENTIFIER(my_schema || '.' || my_table);
Jason
Jane
Frank

You can use concat to qualify or modify the string, and the string may include backticks for more esoteric identifiers:

SET VAR my_table = 'resi';
SELECT city FROM IDENTIFIER('`default`.' || my_table || 'dents');
Jasper
Jericho
Frankfurt

Where it works

We saw above that the IDENTIFIER clause can be used in the FROM clause to refer to a table.
Where else can it be used?

The clause can be used to interpret a constant string as a:

table or view name
function name
column name
field name
schema name

The clause is supported for the following statements:

Table, view, or function name of a CREATE, ALTER, DROP, UNDROP
Table name of a MERGE, UPDATE, DELETE, INSERT, COPY INTO
Target of a SHOW or DESCRIBE
USE of a schema
A function reference in an expression.
A column, filed, or variable reference in an expression.
This includes expressions embedded in a DDL or DML statement.

Examples

You can choose which columns to return or search for patterns:

DECLARE col_name = 'first_name';
SELECT IDENTIFIER(col_name) FROM residents WHERE IDENTIFIER(col_name) LIKE 'F%';
Frank

Or you can select fields to be returned from a struct:

DECLARE OR REPLACE field_name = 'street';
SELECT IDENTIFIER('address. ' || field_name) FROM residents; 
100 Jewel St.
12 Jello Ct.
Wormser Str. 4

You can pick which aggregate functions to use for some analytics query:

DECLARE agg_name = 'min';
SELECT IDENTIFIER(agg_name)(first_name) FROM residents;
Frank

You can even build parameterized scripts that create temporary tables of some name, insert into them, access to them, and finally drop them.

DECLARE tab_name = 'tmp_' || translate(uuid(), '-', '_');
CREATE TABLE IDENTIFIER(tab_name)(c1 INT);
INSERT INTO IDENTIFIER(tab_name) VALUES(1);
SELECT * FROM IDENTIFIER(tab_name);
1

DROP TABLE IDENTIFIER(tab_name);

What doesn’t work

It is important to note that IDENTIFIER does have some limits. For example, you cannot use it to produce correlation names:

SELECT * FROM VALUES(1) AS t(IDENTIFIER(‘c1’));

You can also not use it for column specifications in a CREATE or ALTER TABLE. For DDL statements only the “subject” of the statement can be parameterized using IDENTIFIER. Lastly, IDENTIFIER must be used to produce the entire optionally qualified name. You cannot prefix or postfix it with a dot (.). Instead use concat if needed to build the qualified name.

An obvious possible future extension would be to generate a “set” of identifiers. Something like:

SELECT IDENTIFIER(ARRAY(‘c1’, ‘c2’)) FROM VALUES(1, 2, 3) AS t(c1, c2, c3);
1  2

Conclusion

The IDENTIFIER clause fills a need for parameterizing names in SQL statements without having to resort to SQL building using string concatenation.

Databricks

IDENTIFIER: Turning ticks to backticks since DBR 13.2

Motivation

How it works

Where it works

Examples

What doesn’t work

Conclusion

Related

5 Common Pitfalls to Avoid when Migrating to Unity Catalog

UC Catalog Cloning: An Automated Approach

Top 10 query performance tuning tips for Databricks Serverless SQL