Databricks Community

jsdmatrix · 2 weeks ago

The Apache Spark™ 4.0 introduces a new feature for SQL developers and data engineers: SQL Scripting. As such, this feature enhances the power and extends the flexibility of Spark SQL, enabling users to write procedural code within SQL queries, with the added support of ANSI SQL/PSM scripting language in Spark 💥.

Yes, you can now have control flow and conditional statements, allowing you to execute complex logic, loops, and conditionals directly in your SQL scripts. Whether your workloads involve data pipelines, transformations, or explorations of large datasets, Apache Spark™ 4.0 SQL Scripting extends the power of SQL to manipulate data with ease through procedural programming.

There are three primary reasons why this feature will help SQL developers who want to harness the power of Spark.

1. Familiarity and Portability

ANSI SQL/PSM is a standard and common way to write step-by-step logic in SQL, and many SQL developers already utilize it in databases such as PostgreSQL, Oracle, or SQL Server. By adding support for this in Spark, it becomes significantly easier to migrate existing SQL scripts into Spark without substantial changes. This means that SQL developers and data analysts can fully leverage Spark’s speed and power when working with data in the Lakehouse.

2. Better Control Flow in SQL Pipelines

Standard SQL in Spark is great for simple queries, but it struggles with things like variables, loops, conditions, or more complex logic. The ANSI SQL/PSM scripting adds language constructs like DECLARE, SET, IF-THEN-ELSE, WHILE…END, FOR …DO , REPEAT, CASE, and BEGIN…END, so you can write advanced logic directly in SQL without needing to switch to Scala, Python, or other tools.

3. Easy Integration

Because ANSI SQL/PSM works seamlessly with Spark SQL, it enables complex logic without switching to another language like Python or Scala. Before, you had to use PySpark to write procedural logic and Spark SQL for queries. With this feature, data analysts can easily use their familiar language to do both, in SQL scripting.

Let’s examine a simple example code in ANSI SQL/PSM script with an equivalent code in PySpark script. Assume we have a newly created target table t.

# Create an empty DataFrame with the same schema as the target table
# t
df = spark.createDataFrame([], schema="c INT")

# Loop to insert values into the DataFrame
c = 10
while c > 0:
    df = df.union(spark.createDataFrame([(c,)], schema="c INT"))
    c -= 1

# Insert the DataFrame into the target table
df.write.insertInto("t")

# Display the contents of the table
(spark.sql("SELECT * FROM t")).show()

+--+
|c |
+--+
|10|
|9 |
|8 |
|7 |
|6 |
|5 |
+--+

And its equivalent ANSI SQL/PSM script:

BEGIN
  DECLARE c INT = 10;
  WHILE c > 0 DO
    INSERT INTO t VALUES (c);
    SET c = c - 1;
  END WHILE;
  SELECT * FROM T;
END

+--+
|c |
+--+
|10|
|9 |
|8 |
|7 |
|6 |
|5 |
|4 |
|3 |
|2 |
|1 |
+--+

Here is another example of a FOR … LOOP construct in ANS SQL/PSM script.

-- Short ANSI SQL/PSM Script with FOR loop calculations
BEGIN
    -- Declare variables
    DECLARE total_salary DECIMAL(12,2) DEFAULT 0;
    DECLARE total_ages INTEGER DEFAULT 0;
    
    -- Create and populate table
    CREATE TABLE jsd_employee (
        name VARCHAR(100),
        age INTEGER,
        salary DECIMAL(10,2),
        department VARCHAR(50)
    );
    
    INSERT INTO jsd_employee VALUES
        ('John Smith', 28, 65000.00, 'Engineering'),
        ('Sarah Johnson', 34, 75000.00, 'Marketing'),
        ('Mike Davis', 42, 85000.00, 'Finance'),
        ('Emily Brown', 29, 58000.00, 'HR');
    
    -- FOR loop to calculate totals
    FOR amp AS (SELECT age, salary FROM jsd_employee) DO
        SET total_salary = total_salary + emp.salary;
        SET total_ages = total_ages + emp.age;
    END FOR;
    
    -- Display results
    SELECT total_salary AS `Total Salary`, total_ages AS `Total Ages`;
    
END;

+------------+----------+
|Total Salary|Total Ages|
+------------+----------+
|    283000.0|       133|
+------------+----------+

And its PySpark 🐍 equivalent

from pyspark.sql import SparkSession
from pyspark.sql.functions import sum as spark_sum

# Initialize Spark and create DataFrame
spark = SparkSession.builder.appName("EmployeeAnalysis").getOrCreate()

employee_df = spark.createDataFrame([
    ("John Smith", 28, 65000.00, "Engineering"),
    ("Sarah Johnson", 34, 75000.00, "Marketing"),
    ("Mike Davis", 42, 85000.00, "Finance"),
    ("Emily Brown", 29, 58000.00, "HR")
], ["name", "age", "salary", "department"])

# Calculate and display totals
employee_df.agg(
    spark_sum("salary").alias("Total Salary"),
    spark_sum("age").alias("Total Ages")
).show()

spark.stop()

+------------+----------+
|Total Salary|Total Ages|
+------------+----------+
|    283000.0|       133|
+------------+----------+

While PySpark code is shorter and succinct, for a Data Analyst not familiar with PySpark, it can be onerous to read. By contrast, they can use a familiar procedural language in SQL to express the same logic and the outcome.

What’s next?

You can try this feature in Apache Spark™ 4.0.0 release. You can try it for yourself.

I have only touched the surface of its full capabilities as a procedural language. To get an extensive and deep dive exposure, check out these sources 👇: