Databricks Community

SergeRielau · ‎02-06-2024

What are SQL Session Variables?

Let’s pick each name component apart:

Variables
You can assign values to them once and then repeatedly use each value in other places, such as queries. You can also override the assigned value.
Session-based
The lifetime of the definition of the variable and the scope of its content is limited to a session. Different concurrent sessions can define and use variables by the same names without interfering with each other.
SQL
Unlike Hive variables or notebook widgets, variables are part and parcel of the SQL language. This means

You can use DDL statements to define and drop variables.
Variables can be explicitly typed and initialized using SQL expressions.
The name of each variable is an SQL identifier that may be qualified.
```
DECLARE tomorrow DATE DEFAULT current_date + INTERVAL '1' DAY;
```
You use a SQL Statement to set one or more variables.
```
SET VAR pie_day = DATE'2023-03-14';
```
You can reference a variable just like a column in most places expressions are allowed
There are scoping rules that dictate how name conflicts between variables and columns are handled (columns always win)
```
SELECT session.pie_day, pie_day FROM VALUES(pi()) AS t(pie_day);
2023-03-14  3.141592653589793
```

Using SQL Session Variables

Defining variables

Unlike temporary views and temporary functions, session variables are defined using a DECLARE statement. This is industry standard and compatible with variables in ISO SQL/PSM.

For clarity, you may also use DECLARE VARIABLE

DECLARE [OR REPLACE] [VARIABLE] var_name [ data_type ] [ { DEFAULT | = } expr ]

Variables reside in schemas, and session variables reside in SYSTEM.SESSION.
When declaring a variable you may qualify it as such.
The variable can be of any datatype supported in SQL, including complex MAP, ARRAY, or STRUCT composites.
You may even store small result sets in a map, as long as you are prudent about memory usage.

When a variable is first declared, its initial value is NULL unless you specify a default expression.
If you do, the expression is evaluated immediately, and the value is assigned to the variable.
For now, this default value may not contain any expression subqueries.

Note that when you define a default expression, you can derive the variable type from it.
This allows a dense definition:

DECLARE dept_id = uuid();

The most verbose equivalent definition for the above is:

DECLARE VARIABLE dept_id STRING DEFAULT uuid();

Referencing variables

Now that we have declared a variable, we can use it nearly everywhere an expression is allowed with a few sensible exceptions: Since the variables are only known within the session, they cannot be used in any persistent object definition, such as in a column default expression, or a CREATE VIEW statement.
You can, however, refer to them within other session-scoped objects, such as temporary views.

Let’s add a simple schema and use a variable to carry over a generated primary key to a foreign key.

CREATE TABLE dept(id STRING, name STRING);
CREATE TABLE emp(id STRING, name STRING, dept_id STRING);
INSERT INTO dept SELECT dept_id, 'Accounting';
INSERT INTO emp SELECT uuid(), 'Jones', dept_id;

As you can see, we were able to reuse the same generated ID in two places, and then use the variable just like a column reference.

Since variable names can overlap with column names, it is important to know how Catalyst resolves potential conflicts. Name resolution in SQL is done from the inside out. That is, local (innermost) names occlude names that are further up and out. Variables are in the outermost scope.
So they are occluded by any column or alias in scope, even by parameters of SQL UDF.
To force the resolution of a variable, you can qualify it by its schema: SESSION or SYSTEM.SESSION.

SELECT name FROM dept WHERE id = session.dept_id;
Accounting

Setting variables

A variable wouldn’t be variable if you couldn’t change its value. So Databericks support a SET VARIABLE statement. SET would have been nicer. Alas, it is impossible to cleanly separate the Hive variable and config namespaces from that of SQL Session Variables. Databricks makes up for it with a generous set of capabilities and a VAR shorthand.

SET { VAR | VARIABLE } { var_name = { expression | DEFAULT } } [, ...]

SET { VAR | VARIABLE } ( var_name [, ...] ) = ( query ) }

In short, you can:

Set one or more variables to an expression each by comma separating them.
Reset a variable to its default value. Note that this re-executes the default expression. It’s not necessarily resetting to the original value if that expression is not deterministic.
Assign the results of a query returning at most one row to a set of variables.

Let’s look at some examples:

Swap two variables
All sources will get executed before the first assignment. There is no need for a third variable.

DECLARE x = 6;
DECLARE y = 7;
SET VAR x = y,
        y = x;
VALUES (x, y);
 7  6

Refresh a variable using DEFAULT

SET VAR dept_id = DEFAULT;
INSERT INTO dept SELECT dept_id, 'HR';
INSERT INTO emp  SELECT uuid(), 'Smith', dept_id;

SET VAR dept_id = DEFAULT;
INSERT INTO dept SELECT dept_id, 'Sales';
INSERT INTO emp SELECT uuid(), 'Baker', dept_id;

Set multiple variables to the result of a query

DECLARE num_emps INT;
SET VAR (dept_id, num_emps)
        = (SELECT id, (SELECT COUNT(1) 
                         FROM emp
                          WHERE dept_id = id)
             FROM dept
             WHERE name = 'Accounting');
VALUES (dept_id, num_emps);
b520f54d-916c-4c63-96b1-ff3a69bb9281 1

If a query returns no rows, NULLs are assigned.

DECLARE num_emps INT;
SET VAR (dept_id, num_emps)
      = (SELECT id, (SELECT COUNT(1)
                       FROM emp
                       WHERE dept_id = id)
           FROM dept
           WHERE name = 'Engineering');
VALUES (dept_id, num_emps);
NULL NULL

Dropping variables

Generally speaking there is no need to drop session variables because they will disappear at the end of the session. But if you care, you can drop variables using the DROP TEMPORARY VARIABLE statement.

DROP TEMPORARY VARIABLE [IF EXISTS] var_name

Let’s do that:

DROP TEMPORARY VARIABLE session.dept_id;

DROP TEMPORARY VARIABLE x;

DROP TEMPORARY VARIABLE y;

DROP TEMPORARY VARIABLE num_emps;

Conclusion

SQL Session Variables are a valuable new addition to SQL, allowing you to store and reuse intermediate SQL results without needing a host language like Python. SQL Session variables are available starting in Databricks Runtime 14.1 and Apache Spark 3.5. For more information on SQL session variables see Variables in the documentation.

Enjoy, and happy querying!

jamessmith3 · ‎02-14-2024

Can you explain please how this relates to setting variables as var.xxxx and then calling using ${var.xxxx}?

SergeRielau · ‎02-14-2024

@Jame Sure.
SET (without VAR[IABLE]) refers to Hive variables and configs.
And ${var.xxxx} is used to reference hive variables/configs.

They difference between hive variables and session variables is two fold:

Session variables are explicitly declared, they do not appear just by setting them.
That has the benefit that a silly typo won't go unnoticed. And you also get to define the type.
The ${..} notation results in a macro substitution. That is the "text" of the hive variable is substituted into the SQL query string prior to parsing. On one hand this allows great liberty on composing SQL statements (I'll be writing about EXECUTE IMMEDIATELY very shortly). On the other hand it really invites SQL injection attacks.
A session variable cannot modify the SQL statement (outside of EXECUTE IMMEDIATE and IDENTIFIER()), it can only substitute in a value, similar to a parameter marker (`?` or `:parm`)

The TL;DR is that session variables should be better on all fronts.

I dream of a day where we can retire hive variable so we are not forced into the SET VAR syntax for variables....

AntonioR · ‎06-11-2024

Super useful facility and well explained, thanks.

Here's my question, though:

I have a need for a session variable (containing a Map) to be initialized in every session based on the results of a database query. It will be used in row filtering and column masking functions in Unity Catalog, so it's necessary that there's basically no way ever to sidestep this initialization for reasons of needing to protect sensitive data.

Where is the best place to put that initialization code to set session variables?

My first thought had been to do a kind of lazy call to initialize the variables in the row filtering/column masking functions but that's not feasible since SQL UDFs we're using for those can't have side effects such as setting session variables.

I'm in the dark on how to get this done @SergeRielau . Any suggestions appreciated.

SergeRielau · ‎06-11-2024

Hi @AntonioR ,
Thank you, and a very good question.
Session variables are, unfortunately, by their very definition unsafe since they cannot be secured.
Even if we were to support declaring and setting a session variable with complex default there is nothing preventing any user from reading out the value or overwriting it with another.
Do what you want we need to CREATE VARIABLE. That is we need to persist the variable in Unity Catalog and then secure it so that only the owner can update it and only an authorized user can access it.
A created variable still has a session private content, but with a well known definition which can then be used in a SQL UDF or a VIEW.
This is a well known requirement which we hope to address in the future.

Databricks Community

SQL Session Variables: Stash your state, and use it, too.

What are SQL Session Variables?

Using SQL Session Variables

Defining variables

Referencing variables

Setting variables

Dropping variables

Conclusion

Best practices for safe data experimentation with Databricks

Top 10 query performance tuning tips for Databricks Serverless SQL

UC Catalog Cloning: An Automated Approach