cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
SergeRielau
Databricks Employee
Databricks Employee

sql_session.png

What are SQL Session Variables?

Let’s pick each name component apart:

  1. Variables
    You can assign values to them once and then repeatedly use each value in other places, such as queries. You can also override the assigned value.
  2. Session-based
    The lifetime of the definition of the variable and the scope of its content is limited to a session. Different concurrent sessions can define and use variables by the same names without interfering with each other.
  3. SQL
    Unlike Hive variables or notebook widgets, variables are part and parcel of the SQL language. This means
    1. You can use DDL statements to define and drop variables.
    2. Variables can be explicitly typed and initialized using SQL expressions.
    3. The name of each variable is an SQL identifier that may be qualified.
      DECLARE tomorrow DATE DEFAULT current_date + INTERVAL '1' DAY;
    4. You use a SQL Statement to set one or more variables.
      SET VAR pie_day = DATE'2023-03-14';
    5. You can reference a variable just like a column in most places expressions are allowed
    6. There are scoping rules that dictate how name conflicts between variables and columns are handled (columns always win)
      SELECT session.pie_day, pie_day FROM VALUES(pi()) AS t(pie_day);
      2023-03-14  3.141592653589793

Using SQL Session Variables

Defining variables

Unlike temporary views and temporary functions, session variables are defined using a DECLARE statement. This is industry standard and compatible with variables in ISO SQL/PSM.

For clarity, you may also use DECLARE VARIABLE

DECLARE [OR REPLACE] [VARIABLE] var_name [ data_type ] [ { DEFAULT | = } expr ]

Variables reside in schemas, and session variables reside in SYSTEM.SESSION.
When declaring a variable you may qualify it as such.
The variable can be of any datatype supported in SQL, including complex MAP, ARRAY, or STRUCT composites.
You may even store small result sets in a map, as long as you are prudent about memory usage.

When a variable is first declared, its initial value is NULL unless you specify a default expression.
If you do, the expression is evaluated immediately, and the value is assigned to the variable.
For now, this default value may not contain any expression subqueries.

Note that when you define a default expression, you can derive the variable type from it.
This allows a dense definition:  

DECLARE dept_id = uuid();

The most verbose equivalent definition for the above is:

DECLARE VARIABLE dept_id STRING DEFAULT uuid();

Referencing variables

Now that we have declared a variable, we can use it nearly everywhere an expression is allowed with a few sensible exceptions: Since the variables are only known within the session, they cannot be used in any persistent object definition, such as in a column default expression, or a  CREATE VIEW statement.
You can, however, refer to them within other session-scoped objects, such as temporary views. 

Let’s add a simple schema and use a variable to carry over a generated primary key to a foreign key.

CREATE TABLE dept(id STRING, name STRING);
CREATE TABLE emp(id STRING, name STRING, dept_id STRING);
INSERT INTO dept SELECT dept_id, 'Accounting';
INSERT INTO emp SELECT uuid(), 'Jones', dept_id;

As you can see, we were able to reuse the same generated ID in two places, and then use the variable just like a column reference.

Since variable names can overlap with column names, it is important to know how Catalyst resolves potential conflicts. Name resolution in SQL is done from the inside out. That is, local (innermost)  names occlude names that are further up and out. Variables are in the outermost scope.
So they are occluded by any column or alias in scope, even by parameters of SQL UDF.
To force the resolution of a variable, you can qualify it by its schema: SESSION or SYSTEM.SESSION.  

SELECT name FROM dept WHERE id = session.dept_id;
Accounting

Setting variables

A variable wouldn’t be variable if you couldn’t change its value. So Databericks support a SET VARIABLE statement. SET would have been nicer. Alas, it is impossible to cleanly separate the Hive variable and config namespaces from that of SQL Session Variables.  Databricks makes up for it with a generous set of capabilities and a VAR shorthand.

SET { VAR | VARIABLE } { var_name = { expression | DEFAULT } } [, ...]

SET { VAR | VARIABLE } ( var_name [, ...] ) = ( query ) }

In short, you can:

  • Set one or more variables to an expression each by comma separating them.
  • Reset a variable to its default value. Note that this re-executes the default expression. It’s not necessarily resetting to the original value if that expression is not deterministic.
  • Assign the results of a query returning at most one row to a set of variables.

Let’s look at some examples:

  • Swap two variables
    All sources will get executed before the first assignment. There is no need for a third variable.
DECLARE x = 6;
DECLARE y = 7;
SET VAR x = y,
        y = x;
VALUES (x, y);
 7  6
  • Refresh a variable using DEFAULT
SET VAR dept_id = DEFAULT;
INSERT INTO dept SELECT dept_id, 'HR';
INSERT INTO emp  SELECT uuid(), 'Smith', dept_id;

SET VAR dept_id = DEFAULT;
INSERT INTO dept SELECT dept_id, 'Sales';
INSERT INTO emp SELECT uuid(), 'Baker', dept_id;
  • Set multiple variables to the  result of a query
DECLARE num_emps INT;
SET VAR (dept_id, num_emps)
        = (SELECT id, (SELECT COUNT(1)

                         FROM emp
                         WHERE dept_id = id)
             FROM dept
             WHERE name = 'Accounting');
VALUES (dept_id, num_emps);
b520f54d-916c-4c63-96b1-ff3a69bb9281 1   
  • If a query returns no rows, NULLs are assigned.
DECLARE num_emps INT;
SET VAR (dept_id, num_emps)
      = (SELECT id, (SELECT COUNT(1)
                       FROM emp
                       WHERE dept_id = id)
           FROM dept
           WHERE name = 'Engineering');
VALUES (dept_id, num_emps);
NULL NULL

Dropping variables

Generally speaking there is no need to drop session variables because they will disappear at the end of the session. But if you care, you can drop variables using the DROP TEMPORARY VARIABLE statement.

DROP TEMPORARY VARIABLE [IF EXISTS] var_name

Let’s do that:

DROP TEMPORARY VARIABLE session.dept_id;

DROP TEMPORARY VARIABLE x;

DROP TEMPORARY VARIABLE y;

DROP TEMPORARY VARIABLE num_emps;

Conclusion

SQL Session Variables are a valuable new addition to SQL, allowing you to store and reuse intermediate SQL results without needing a host language like Python. SQL Session variables are available starting in Databricks Runtime 14.1 and Apache Spark 3.5. For more information on SQL session variables see Variables in the documentation.

Enjoy, and happy querying!

4 Comments