UDF not working inside transform() & lambda (SQL)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-28-2023 11:58 AM - edited 06-28-2023 12:03 PM
Below is toy example of what I'm trying to achieve, but don't understand why it fails. Can anyone explain why, and suggest a fix or not overly bloated workaround?
%sql
create or replace function status_map(status int)
returns string
return map(
10, "STATUS_10",
20, "STATUS_20"
)[status];
with s1 as (
select array(map('v1', 10), map('v1', 20)) as arr
)
-- WORKS: select transform(arr, e -> e.v1) from s1
-- WORKS: select transform(arr, e -> status_map(10)) from s1
-- WHY THIS NO WORKS?!?!?111
select transform(arr, e -> status_map(e.v1)) from s1
The code below fails with:
Error in SQL statement: AnalysisException: Resolved attribute(s) e#134732 missing from in operator !Project [cast(lambda e#134732[v1] as int) AS status#134734].; line 8 pos 27
Behavior is same if I use named_struct() instead of map().
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-25-2023 04:40 AM
This doesn't answer the question, but apparently someone asked a similar question on StackOverflow too and also got zero replies https://stackoverflow.com/questions/75010883/error-when-passing-spark-sql-higher-order-functions-lam...
I found this thread and that question because I've also been having the same problem.
Given that apparently Python UDFs https://stackoverflow.com/questions/53761600/how-to-use-transform-higher-order-function/53762650#537... aren't supported, and SQL UDFs appear to be a newer feature than Python UDFs, probably SQL UDFs just aren't supported for use with `transform`.
I hope I'm wrong though because it's very frustrating -- SQL UDFs are basically essential for making code modular and legible, and code using transform and arrays is already difficult enough to maintain without inlining SQL UDFs. Arrays and structs and other complex data types are supposed to be a major differentiator for Spark SQL vs other SQL implementations, and Databricks vs other data warehouses, but basic features for reducing code complexity / increasing code modularity + maintainability aren't supported when working with those data types.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-25-2023 05:17 AM
the transform function in sql is not the same as the scala/pyspark counterpart. It is in fact a map().
Here is some interesting info
I agree that functions are essential for code modularity. Hence I prefer not to use sql but scala/pyspark instead.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-28-2023 03:48 PM
I agree with you regarding Scala UDFs being helpful. Given that I personally know very little Scala, and that Python UDFs are much less performant than SQL UDFs and Scala UDFs, I prefer to use SQL UDFs when possible. It takes me 20 times longer or worse to implement the same logic in Scala as it would to write it in SQL. If I was proficient in Scala I would probably use Scala UDFs more often, given that they are the best supported and most flexible.
Regarding the distinction between the Spark SQL map / higher order function "transform" and the different "transform" in PySpark / Scala, I don't think there was any confusion in the question. There wasn't for me personally at least although I can't speak for the OP of course. I wish they weren't named the same if for no other reason than making Google search results easier / quicker to process.
As far as using the Spark SQL "transform" goes (as opposed to PySpark / Scala "transform"), it seems that only Scala UDFs are supported. Again I find that surprising because I always assumed that under the hood the SQL UDFs were "inlined" or "macroed" almost directly into Scala Spark code (using Catalyst expressions). But based on the behavior that seems to have been an oversimplification.

