Databricks Community

holly · ‎08-07-2024

This blog was posted to my feed, and it split readers into two groups. Those who thought ‘meh’ and moved on with their life, and then the serious minority who thought this was the most super duper better than sliced bread blow your socks off feature.

This piqued my curiosity, how could a feature I’d barely heard about be making such a difference? Maybe someone else’s project could be dramatically improved, if only they knew about this feature? So I thought it warranted a video. How hard could it be?

The Dunning-Kruger Effect explains the relationship between confidence and competence, and it’s fair to say that when I embarked on this endeavour I was at the peak of mount stupid*.

*I am aware this graph is a complete misrepresentation of the phenomenon, the irony

It turns out that Execute Immediate is not a feature in isolation, it builds on many other features that I’d either never heard of before, or that I hadn’t thought critically about.

In the Valley of Despair lay Identifier, Variables and Parameters in a tangled overlapping web that needs unpicking.

If you read the docs as they are today, it's quite a circular set of features. We’re going to go through this together, and because I want to maximise clarity instead of brevity, we’re going to go through in chronological order of Databricks runtime (DBR) releases.

DBR 12.1: Parameters

These are used to pass values into SQL statements. You can only use parameters for values within a table, so not a column name or table name (yet!).

# creating x and y to substitute into a SQL query. This will work
spark.sql("SELECT :x * :y AS area", args = { "x" : 3, "y" : 4 })

# it resolves to...
SELECT 3 * 4 AS area

# this won't work, with or without quotations around 'height'
spark.sql("SELECT :x  AS :y ", args = { "x" : 3, "y" : “area” })

A note on ‘coding’ style. I always get confused reading blogs which parts of the code I’m meant to care about, so I’ve bolded the sections we should be paying attention to. It won’t display like this in Databricks.
Edit: apparently it doesn't display properly once this blog is rendered either, and instead makes it slightly fuzzier 🫠

My example is fairly tame, but the usefulness comes into its own when these parameters are defined programmatically, maybe through loops, if / then / else statements or parameters passed in through jobs.

Now originally this wasn’t in pure SQL and only available in python, scala and java via spark.sql() syntax. That’s because to make it work with SQL you need Variables and they don’t get here until DBR 14.1.

DBR 13.1: Unnamed ? Parameters

If you have a hard time naming parameters you can skip this step and call them ? instead.

Reddit suggested childOfPartnerAndI[0]

Whilst this may save you some time, when using it for multiple places keep in mind that:

{ } curly brackets go to [ ] square brackets
Order now matters

#creating x and y to substitute into a SQL query
spark.sql("SELECT ? * ? AS area", args = [ 3, 4 ])

#it resolves too...
SELECT 3 * 4 AS area

If the same value is used n times, you’ll have to repeat yourself n-1 times

# named parameter repetition
spark.sql("SELECT :z * :z AS square", args = {"z" : 5})

# unnamed parameter repetition
spark.sql("SELECT ? * ? AS square", args = [ 5, 5 ])

I’m honestly not sure how this is more useful.

DBR 13.3: Identifiers

Up until now we’ve only been able to use parameters with values inside our SQL statement, which is good, but lacks the serious power of being able to use it with column or table names. This is where IDENTIFIER() comes in.

With identifiers, you can now parameterise catalog, schema, table, view, function, column or struct field names.

What’s great is that it’s SQL injection proof. For those unfamiliar with SQL injections, it’s what unscrupulous people might use to drop data or cause havoc should they be able to enter free text. This is seen widely with front end applications that have a UI, and occasionally back end applications that don’t.

This joke wouldn’t have been possible if the school was using Databricks. Just sayin’'

Here’s some python identifier examples:

# using it for a schema and table name
spark.sql("create table identifier(:x||'.'||:y) as SELECT 12 AS area",
         args = { "x" : "examples_db", "y" : "area"})

# using schema and table names in one step with backticks 
spark.sql("create table identifier(:z) as SELECT 12 AS area",
         args = { "z" : "`examples_db`.`area`"}) 

# using it with column or field names
spark.sql("""SELECT identifier(:column),     
         identifier('claim_struct.'||:field)
         from examples_db.claim_data""",
         args = { "column" : "claim_no", "field" : "type"})

# using it with a function
spark.sql("SELECT * FROM identifier(:function)()", args = {"function" : "country"})

At time of writing, the docs only have scala and SQL examples in them. I promise you it does work with python and I’m working with the docs team to get this included.

Unfortunately, you can’t use all as identifiers all the time. There’s a few rules to using it:

	Catalog	Schema	Table	View	Function	Column	Field
Create, alter, drop, undrop	✅	✅	✅	✅	❌	❌	❌
Merge, update, delete, insert, copy into	✅	✅	✅	-	-	❌	❌
Show, describe	✅	✅	✅	✅	✅	-	-
Use	✅	✅	-	-	-	-	-
Function invocation	-	-	-	-	✅	-	-
Referenced as part of a query	✅	✅	✅	✅	✅	✅	✅
Nested	❌	❌	❌	❌	❌	❌	❌

Key: ✅ compatible, - not relevant, ❌ no functionality

In the docs this was wordier rather than a matrix. This is based off my own testing in DBR 15.3 and I’m working with the docs team to get this clearer.

With IDENTIFIER() you can’t define column names as part of the create or update statements, eg

# defining a name within the SQL statement won't work
spark.sql("create or replace table example_db.area as SELECT 12 AS identifier(:x)"
         , args = {"x" : "area"})

DBR 14.1 Variables and Declare

So far I’ve only shown python examples, and that’s because to make SQL work we need additional features that are released in DBR 14.1 and that is Variables with the syntax DECLARE. This allows you to pass parameters or arguments into SQL statements without having to switch language.

# python syntax
spark.sql("SELECT :x * :y AS area", args = { "x" : 3, "y" : 4 })

# sql syntax (verbose, specifies the type and default values if it's unassigned)
DECLARE VARIABLE x INT DEFAULT 3;
# sql syntax (minimalist)
DECLARE y = 4;

SELECT x * y AS area;

For those of you doing migrations and are staring at thousands of scripts, this is a blessing.

As for the difference between a variable and a parameter, this has left me in a state of vexation for longer than I’d like to admit. After spending more time than I’d like to admit thinking out it, here’s how I break it down:

	Parameter	Variable
Language	Python, Scala, Java SQL with additional syntax*	SQL
Functionality	Low	High: Saved for the whole session Independent of query Can define type Can set defaults Can overwrite Can drop

*more on this later

Let’s have a look at some of this functionality:

--just like tables, you can’t make one with the same name if it already exists. 
--Use ‘or replace’ to overwrite values.    
DECLARE OR REPLACE VARIABLE x INT DEFAULT 3;

--alternatively, you could declare it in a setup script, then use SET to update values later in the script. 
--SET won’t update the type or default though.
DECLARE VARIABLE y INT DEFAULT NULL;
SET VARIABLE y = 5;
SET VARIABLE y = DEFAULT;

--once you’re done with a variable, you can drop it. 
--Why it needs temporary in there I have no idea. There are no minimalist versions of this with VAR   
DROP TEMPORARY VARIABLE y;

--in combination with IDENTIFIER() for column, schema and table names 
DECLARE col_name = "area";
DECLARE table_id = '`example_db`.`table_name`'; --note the backticks and single quotes
SELECT identifier(col_name) from identifier(table_id);

-- resolving name collisions between columns and variables. 
-- This one was a head scratcher for me, so try it out if you're struggling too.
SET VAR myvar = 2;
SELECT session.myvar, --returns 2 as it refers to the session variable set above
      T.myvar, -- returns 1 as it's referring to the column created in the FROM statement below
      myvar, -- returns 1 as it's referring to the column created in the FROM statement below
      FROM VALUES(1) AS T(myvar); --without T() then vanilla 'myvar' would resolve to the variable, not the column

All of these features allow for incredibly powerful scripting in a SQL only scenario. This is great for those wanting to (lazily) maximise performance by using serverless, photon and all the automated performance goodies.

DBR 14.3: Execute Immediate

There’s one more trick we have to make our SQL even more powerful. So far we’ve been able to put variables into our dynamic SQL code, but what if instead, we want the other way round - our variables defined by dynamic SQL code.

There’s a few elements we’re chaining together here, so bare with me as we walk through.

EXECUTE IMMEDIATE will run a string that looks like a SQL statement

EXECUTE IMMEDIATE 'SELECT 3 * 4 AS area';

Now that’s not very impressive by itself, but we can start adding parameter markers, either with ? or names, or declare some variables.

--unnamed ? parameters, remember order matters here
EXECUTE IMMEDIATE 'SELECT ? * ? AS area'
USING 3,4;

--or we can name parameters, then order doesn't matter
EXECUTE IMMEDIATE 'SELECT :y * :x AS area'
USING (3 as x, 4 as y);

--or if we wanted to define them much earlier in our script, we could do that with variables
DECLARE x INT = 3;
DECLARE y INT = 4;
/* imagine much more code here*/
EXECUTE IMMEDIATE 'SELECT x * y AS area'
USING (x,y);

--or if we wanted to go all the way with variables, we can even make the SQL string a variable too
DECLARE sql_string STRING = 'SELECT x * y AS area';
EXECUTE IMMEDIATE sql_string
--reuse x and y from earlier
USING (x,y);

And yes, this is similar to functionality that we had above. But here’s the next step where EXECUTE IMMEDIATE comes into its own. With our results we can assign them to variables using INTO command.

--these variables are used as part of the input
DECLARE x INT = 3;
DECLARE y INT = 4;

--these variables are empty placeholders for our output
DECLARE area INT;
DECLARE length INT;

--we can use INTO to assign the output to variables
EXECUTE IMMEDIATE 'SELECT x * y AS area, greatest(x,y) AS length'
INTO area, length
USING (x,y);

SELECT length;
--returns 4

This is a fairly trivial example, but it can be used to chain parameterised SQL statements together.

Anticipated FAQ

Will EXECUTE IMMEDIATE be ported to python or scala?
I doubt it, given those languages have a lot more flexibility, I personally think it’s a moot feature

Is this exclusively a Databricks feature set?
No, Spark 4.0 contains EXECUTE IMMEDIATE, Identifiers & Parameters and Variables. The open source version of Identifier() is SQL injection proof too.

It’s not a new phenomenon either with Oracle and PostgreSQL having these features for a while, although I couldn’t attest to whether their version of Identifier() was SQL injection proof too.

TL:DR

Let’s summarise these things for something more succinct way

Parameters - used to insert values into spark.sql() statements. Most commonly found in python, scala & java. Refer to them with :parameterName syntax.
Variables - more advanced version of parameters, used exclusively in SQL. Declare them before use.
Identifiers - allows you to insert parameters or variables in metadata and surrounding functionality, ie view names. column names or functions.
Execute Immediate - start chaining variables and SQL strings together to either execute code, or make even more variables.

Hopefully this gives you enough context to try it out with your next SQL project. It does take a while to sink in, but once it clicks it’s a very powerful tool.

Databricks Community

How not to build an Execute Immediate demo

DBR 12.1: Parameters

DBR 13.1: Unnamed ? Parameters

DBR 13.3: Identifiers

DBR 14.1 Variables and Declare

DBR 14.3: Execute Immediate

Anticipated FAQ

TL:DR

Further Reading

Best practices for safe data experimentation with Databricks

Top 10 query performance tuning tips for Databricks Serverless SQL

Metadata-Driven ETL Framework in Databricks (Part-1)