Let’s pick each name component apart:
DECLARE tomorrow DATE DEFAULT current_date + INTERVAL '1' DAY;
SET VAR pie_day = DATE'2023-03-14';
SELECT session.pie_day, pie_day FROM VALUES(pi()) AS t(pie_day);
2023-03-14 3.141592653589793
Unlike temporary views and temporary functions, session variables are defined using a DECLARE statement. This is industry standard and compatible with variables in ISO SQL/PSM.
For clarity, you may also use DECLARE VARIABLE
DECLARE [OR REPLACE] [VARIABLE] var_name [ data_type ] [ { DEFAULT | = } expr ]
Variables reside in schemas, and session variables reside in SYSTEM.SESSION.
When declaring a variable you may qualify it as such.
The variable can be of any datatype supported in SQL, including complex MAP, ARRAY, or STRUCT composites.
You may even store small result sets in a map, as long as you are prudent about memory usage.
When a variable is first declared, its initial value is NULL unless you specify a default expression.
If you do, the expression is evaluated immediately, and the value is assigned to the variable.
For now, this default value may not contain any expression subqueries.
Note that when you define a default expression, you can derive the variable type from it.
This allows a dense definition:
DECLARE dept_id = uuid();
The most verbose equivalent definition for the above is:
DECLARE VARIABLE dept_id STRING DEFAULT uuid();
Now that we have declared a variable, we can use it nearly everywhere an expression is allowed with a few sensible exceptions: Since the variables are only known within the session, they cannot be used in any persistent object definition, such as in a column default expression, or a CREATE VIEW statement.
You can, however, refer to them within other session-scoped objects, such as temporary views.
Let’s add a simple schema and use a variable to carry over a generated primary key to a foreign key.
CREATE TABLE dept(id STRING, name STRING);
CREATE TABLE emp(id STRING, name STRING, dept_id STRING);
INSERT INTO dept SELECT dept_id, 'Accounting';
INSERT INTO emp SELECT uuid(), 'Jones', dept_id;
As you can see, we were able to reuse the same generated ID in two places, and then use the variable just like a column reference.
Since variable names can overlap with column names, it is important to know how Catalyst resolves potential conflicts. Name resolution in SQL is done from the inside out. That is, local (innermost) names occlude names that are further up and out. Variables are in the outermost scope.
So they are occluded by any column or alias in scope, even by parameters of SQL UDF.
To force the resolution of a variable, you can qualify it by its schema: SESSION or SYSTEM.SESSION.
SELECT name FROM dept WHERE id = session.dept_id;
Accounting
A variable wouldn’t be variable if you couldn’t change its value. So Databericks support a SET VARIABLE statement. SET would have been nicer. Alas, it is impossible to cleanly separate the Hive variable and config namespaces from that of SQL Session Variables. Databricks makes up for it with a generous set of capabilities and a VAR shorthand.
SET { VAR | VARIABLE } { var_name = { expression | DEFAULT } } [, ...]
SET { VAR | VARIABLE } ( var_name [, ...] ) = ( query ) }
In short, you can:
Let’s look at some examples:
DECLARE x = 6;
DECLARE y = 7;
SET VAR x = y,
y = x;
VALUES (x, y);
7 6
SET VAR dept_id = DEFAULT;
INSERT INTO dept SELECT dept_id, 'HR';
INSERT INTO emp SELECT uuid(), 'Smith', dept_id;
SET VAR dept_id = DEFAULT;
INSERT INTO dept SELECT dept_id, 'Sales';
INSERT INTO emp SELECT uuid(), 'Baker', dept_id;
DECLARE num_emps INT;
SET VAR (dept_id, num_emps)
= (SELECT id, (SELECT COUNT(1)
FROM emp
WHERE dept_id = id)
FROM dept
WHERE name = 'Accounting');
VALUES (dept_id, num_emps);
b520f54d-916c-4c63-96b1-ff3a69bb9281 1
DECLARE num_emps INT;
SET VAR (dept_id, num_emps)
= (SELECT id, (SELECT COUNT(1)
FROM emp
WHERE dept_id = id)
FROM dept
WHERE name = 'Engineering');
VALUES (dept_id, num_emps);
NULL NULL
Generally speaking there is no need to drop session variables because they will disappear at the end of the session. But if you care, you can drop variables using the DROP TEMPORARY VARIABLE statement.
DROP TEMPORARY VARIABLE [IF EXISTS] var_name
Let’s do that:
DROP TEMPORARY VARIABLE session.dept_id;
DROP TEMPORARY VARIABLE x;
DROP TEMPORARY VARIABLE y;
DROP TEMPORARY VARIABLE num_emps;
SQL Session Variables are a valuable new addition to SQL, allowing you to store and reuse intermediate SQL results without needing a host language like Python. SQL Session variables are available starting in Databricks Runtime 14.1 and Apache Spark 3.5. For more information on SQL session variables see Variables in the documentation.
Enjoy, and happy querying!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.