Variables You can assign values to them once and then repeatedly use each value in other places, such as queries. You can also override the assigned value.
Session-based The lifetime of the definition of the variable and the scope of its content is limited to a session. Different concurrent sessions can define and use variables by the same names without interfering with each other.
SQL Unlike Hive variables or notebook widgets, variables are part and parcel of the SQL language. This means
You can use DDL statements to define and drop variables.
Variables can be explicitly typed and initialized using SQL expressions.
The name of each variable is an SQL identifier that may be qualified.
DECLARE tomorrow DATE DEFAULT current_date + INTERVAL '1' DAY;
You use a SQL Statement to set one or more variables.
SET VAR pie_day = DATE'2023-03-14';
You can reference a variable just like a column in most places expressions are allowed
There are scoping rules that dictate how name conflicts between variables and columns are handled (columns always win)
SELECT session.pie_day, pie_day FROMVALUES(pi()) AS t(pie_day); 2023-03-14 3.141592653589793
Using SQL Session Variables
Unlike temporary views and temporary functions, session variables are defined using a DECLARE statement. This is industry standard and compatible with variables in ISO SQL/PSM.
Variables reside in schemas, and session variables reside in SYSTEM.SESSION. When declaring a variable you may qualify it as such. The variable can be of any datatype supported in SQL, including complex MAP, ARRAY, or STRUCT composites. You may even store small result sets in a map, as long as you are prudent about memory usage. When a variable is first declared, its initial value is NULL unless you specify a default expression. If you do, the expression is evaluated immediately, and the value is assigned to the variable. For now, this default value may not contain any expression subqueries.
Note that when you define a default expression, you can derive the variable type from it. This allows a dense definition:
DECLARE dept_id = uuid();
The most verbose equivalent definition for the above is:
DECLAREVARIABLE dept_id STRINGDEFAULT uuid();
Now that we have declared a variable, we can use it nearly everywhere an expression is allowed with a few sensible exceptions: Since the variables are only known within the session, they cannot be used in any persistent object definition, such as in a column default expression, or a CREATE VIEW statement. You can, however, refer to them within other session-scoped objects, such as temporary views.
Let’s add a simple schema and use a variable to carry over a generated primary key to a foreign key.
As you can see, we were able to reuse the same generated ID in two places, and then use the variable just like a column reference.
Since variable names can overlap with column names, it is important to know how Catalyst resolves potential conflicts. Name resolution in SQL is done from the inside out. That is, local (innermost) names occlude names that are further up and out. Variables are in the outermost scope. So they are occluded by any column or alias in scope, even by parameters of SQL UDF. To force the resolution of a variable, you can qualify it by its schema: SESSION or SYSTEM.SESSION.
SELECT name FROM dept WHERE id = session.dept_id; Accounting
A variable wouldn’t be variable if you couldn’t change its value. So Databericks support a SET VARIABLE statement. SET would have been nicer. Alas, it is impossible to cleanly separate the Hive variable and config namespaces from that of SQL Session Variables. Databricks makes up for it with a generous set of capabilities and a VAR shorthand.
Set one or more variables to an expression each by comma separating them.
Reset a variable to its default value. Note that this re-executes the default expression. It’s not necessarily resetting to the original value if that expression is not deterministic.
Assign the results of a query returning at most one row to a set of variables.
Let’s look at some examples:
Swap two variables All sources will get executed before the first assignment. There is no need for a third variable.
DECLARE x = 6; DECLARE y = 7; SET VAR x = y, y = x; VALUES (x, y); 7 6
Refresh a variable using DEFAULT
SET VAR dept_id = DEFAULT; INSERT INTO dept SELECT dept_id, 'HR'; INSERT INTO emp SELECT uuid(), 'Smith', dept_id;
SET VAR dept_id = DEFAULT; INSERTINTO dept SELECT dept_id, 'Sales'; INSERTINTO emp SELECT uuid(), 'Baker', dept_id;
Set multiple variables to the result of a query
DECLARE num_emps INT; SETVAR (dept_id, num_emps) = (SELECT id, (SELECTCOUNT(1) FROM emp WHERE dept_id = id) FROM dept WHERE name = 'Accounting'); VALUES (dept_id, num_emps); b520f54d-916c-4c63-96b1-ff3a69bb9281 1
If a query returns no rows, NULLs are assigned.
DECLARE num_emps INT; SET VAR (dept_id, num_emps) = (SELECT id, (SELECTCOUNT(1) FROM emp WHERE dept_id = id) FROM dept WHERE name = 'Engineering'); VALUES (dept_id, num_emps); NULL NULL
Generally speaking there is no need to drop session variables because they will disappear at the end of the session. But if you care, you can drop variables using theDROPTEMPORARYVARIABLE statement.
DROP TEMPORARY VARIABLE [IF EXISTS] var_name
Let’s do that:
DROP TEMPORARY VARIABLE session.dept_id;
DROP TEMPORARY VARIABLE x;
DROP TEMPORARY VARIABLE y;
DROP TEMPORARY VARIABLE num_emps;
SQL Session Variables are a valuable new addition to SQL, allowing you to store and reuse intermediate SQL results without needing a host language like Python. SQL Session variables are available starting in Databricks Runtime 14.1 and Apache Spark 3.5. For more information on SQL session variables see Variables in the documentation.