06-28-2023 11:58 AM - edited 06-28-2023 12:03 PM
Below is toy example of what I'm trying to achieve, but don't understand why it fails. Can anyone explain why, and suggest a fix or not overly bloated workaround?
%sql
create or replace function status_map(status int)
returns string
return map(
10, "STATUS_10",
20, "STATUS_20"
)[status];
with s1 as (
select array(map('v1', 10), map('v1', 20)) as arr
)
-- WORKS: select transform(arr, e -> e.v1) from s1
-- WORKS: select transform(arr, e -> status_map(10)) from s1
-- WHY THIS NO WORKS?!?!?111
select transform(arr, e -> status_map(e.v1)) from s1
The code below fails with:
Error in SQL statement: AnalysisException: Resolved attribute(s) e#134732 missing from in operator !Project [cast(lambda e#134732[v1] as int) AS status#134734].; line 8 pos 27
Behavior is same if I use named_struct() instead of map().
08-25-2023 04:40 AM
This doesn't answer the question, but apparently someone asked a similar question on StackOverflow too and also got zero replies https://stackoverflow.com/questions/75010883/error-when-passing-spark-sql-higher-order-functions-lam...
I found this thread and that question because I've also been having the same problem.
Given that apparently Python UDFs https://stackoverflow.com/questions/53761600/how-to-use-transform-higher-order-function/53762650#537... aren't supported, and SQL UDFs appear to be a newer feature than Python UDFs, probably SQL UDFs just aren't supported for use with `transform`.
I hope I'm wrong though because it's very frustrating -- SQL UDFs are basically essential for making code modular and legible, and code using transform and arrays is already difficult enough to maintain without inlining SQL UDFs. Arrays and structs and other complex data types are supposed to be a major differentiator for Spark SQL vs other SQL implementations, and Databricks vs other data warehouses, but basic features for reducing code complexity / increasing code modularity + maintainability aren't supported when working with those data types.
08-25-2023 05:17 AM
the transform function in sql is not the same as the scala/pyspark counterpart. It is in fact a map().
Here is some interesting info
I agree that functions are essential for code modularity. Hence I prefer not to use sql but scala/pyspark instead.
08-28-2023 03:48 PM
I agree with you regarding Scala UDFs being helpful. Given that I personally know very little Scala, and that Python UDFs are much less performant than SQL UDFs and Scala UDFs, I prefer to use SQL UDFs when possible. It takes me 20 times longer or worse to implement the same logic in Scala as it would to write it in SQL. If I was proficient in Scala I would probably use Scala UDFs more often, given that they are the best supported and most flexible.
Regarding the distinction between the Spark SQL map / higher order function "transform" and the different "transform" in PySpark / Scala, I don't think there was any confusion in the question. There wasn't for me personally at least although I can't speak for the OP of course. I wish they weren't named the same if for no other reason than making Google search results easier / quicker to process.
As far as using the Spark SQL "transform" goes (as opposed to PySpark / Scala "transform"), it seems that only Scala UDFs are supported. Again I find that surprising because I always assumed that under the hood the SQL UDFs were "inlined" or "macroed" almost directly into Scala Spark code (using Catalyst expressions). But based on the behavior that seems to have been an oversimplification.
05-08-2025 06:51 AM
Scope of lambda implementation within transform function only ,which fails with udf because lambda variable bindings (e in your case) are not available for use within the UDF's scope. we can use below workaround for above example which generate same execution plan and helps to modularize reusable function.
CREATE OR REPLACE FUNCTION status_map(arr ARRAY<MAP<STRING, INT>>)
RETURNS ARRAY<MAP<STRING, STRING>>
RETURN transform(arr, map_item ->
map(
map_entries(map_item)[0].key,
CASE map_entries(map_item)[0].value
WHEN 10 THEN 'STATUS_10'
WHEN 20 THEN 'STATUS_20'
ELSE NULL
END
)
);
WITH s1 AS (
SELECT ARRAY(MAP('v1', 10), MAP('v1', 20)) AS arr
)
SELECT status_map(arr) AS mapped_status FROM s1;
5 hours ago
Scoped variables in a transform() are not accessible by UDFs. However, you can workaround this using explode():
# equivalent of: select transform(arr, e -> status_map(e.v1)) from s1
select collect_list(status_map(status_id))
from explode((select transform(arr, e -> e.v1) from s1)) as t(status_id)To make the code easier to read, you can create a wrapper function:
create or replace function status_map_arr(status_arr array<int>)
returns array<string>
# col is the default column name for `explode()` when an `as table(col)` expression is not used
return select collect_list(status_map(col)) from explode(status_arr);
with s1 as (
select array(map('v1', 10), map('v1', 20)) as arr
)
select status_map_arr(transform(arr, e -> e.v1)) from s1