The Apache Spark™ 4.0 introduces a new feature for SQL developers and data engineers: SQL Scripting. As such, this feature enhances the power and extends the flexibility of Spark SQL, enabling users to write procedural code within SQL queries, with the added support of ANSI SQL/PSM scripting language in Spark 💥.
Yes, you can now have control flow and conditional statements, allowing you to execute complex logic, loops, and conditionals directly in your SQL scripts. Whether your workloads involve data pipelines, transformations, or explorations of large datasets, Apache Spark™ 4.0 SQL Scripting extends the power of SQL to manipulate data with ease through procedural programming.

There are three primary reasons why this feature will help SQL developers who want to harness the power of Spark.
1. Familiarity and Portability
ANSI SQL/PSM is a standard and common way to write step-by-step logic in SQL, and many SQL developers already utilize it in databases such as PostgreSQL, Oracle, or SQL Server. By adding support for this in Spark, it becomes significantly easier to migrate existing SQL scripts into Spark without substantial changes. This means that SQL developers and data analysts can fully leverage Spark’s speed and power when working with data in the Lakehouse.
2. Better Control Flow in SQL Pipelines
Standard SQL in Spark is great for simple queries, but it struggles with things like variables, loops, conditions, or more complex logic. The ANSI SQL/PSM scripting adds language constructs like DECLARE, SET, IF-THEN-ELSE, WHILE…END, FOR …DO , REPEAT, CASE, and BEGIN…END, so you can write advanced logic directly in SQL without needing to switch to Scala, Python, or other tools.
3. Easy Integration
Because ANSI SQL/PSM works seamlessly with Spark SQL, it enables complex logic without switching to another language like Python or Scala. Before, you had to use PySpark to write procedural logic and Spark SQL for queries. With this feature, data analysts can easily use their familiar language to do both, in SQL scripting.
Let’s examine a simple example code in ANSI SQL/PSM script with an equivalent code in PySpark script. Assume we have a newly created target table t.
# Create an empty DataFrame with the same schema as the target table
# t
df = spark.createDataFrame([], schema="c INT")
# Loop to insert values into the DataFrame
c = 10
while c > 0:
df = df.union(spark.createDataFrame([(c,)], schema="c INT"))
c -= 1
# Insert the DataFrame into the target table
df.write.insertInto("t")
# Display the contents of the table
(spark.sql("SELECT * FROM t")).show()
+--+
|c |
+--+
|10|
|9 |
|8 |
|7 |
|6 |
|5 |
+--+
And its equivalent ANSI SQL/PSM script:
BEGIN
DECLARE c INT = 10;
WHILE c > 0 DO
INSERT INTO t VALUES (c);
SET c = c - 1;
END WHILE;
SELECT * FROM T;
END
+--+
|c |
+--+
|10|
|9 |
|8 |
|7 |
|6 |
|5 |
|4 |
|3 |
|2 |
|1 |
+--+
Here is another example of a FOR … LOOP construct in ANS SQL/PSM script.
-- Short ANSI SQL/PSM Script with FOR loop calculations
BEGIN
-- Declare variables
DECLARE total_salary DECIMAL(12,2) DEFAULT 0;
DECLARE total_ages INTEGER DEFAULT 0;
-- Create and populate table
CREATE TABLE jsd_employee (
name VARCHAR(100),
age INTEGER,
salary DECIMAL(10,2),
department VARCHAR(50)
);
INSERT INTO jsd_employee VALUES
('John Smith', 28, 65000.00, 'Engineering'),
('Sarah Johnson', 34, 75000.00, 'Marketing'),
('Mike Davis', 42, 85000.00, 'Finance'),
('Emily Brown', 29, 58000.00, 'HR');
-- FOR loop to calculate totals
FOR amp AS (SELECT age, salary FROM jsd_employee) DO
SET total_salary = total_salary + emp.salary;
SET total_ages = total_ages + emp.age;
END FOR;
-- Display results
SELECT total_salary AS `Total Salary`, total_ages AS `Total Ages`;
END;
+------------+----------+
|Total Salary|Total Ages|
+------------+----------+
| 283000.0| 133|
+------------+----------+
And its PySpark 🐍 equivalent
from pyspark.sql import SparkSession
from pyspark.sql.functions import sum as spark_sum
# Initialize Spark and create DataFrame
spark = SparkSession.builder.appName("EmployeeAnalysis").getOrCreate()
employee_df = spark.createDataFrame([
("John Smith", 28, 65000.00, "Engineering"),
("Sarah Johnson", 34, 75000.00, "Marketing"),
("Mike Davis", 42, 85000.00, "Finance"),
("Emily Brown", 29, 58000.00, "HR")
], ["name", "age", "salary", "department"])
# Calculate and display totals
employee_df.agg(
spark_sum("salary").alias("Total Salary"),
spark_sum("age").alias("Total Ages")
).show()
spark.stop()
+------------+----------+
|Total Salary|Total Ages|
+------------+----------+
| 283000.0| 133|
+------------+----------+