cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to apply a UDF to a property in an array of structs

RichardDriven
New Contributor III

I have a column that contains an array of structs as follows:

"column" : [ 
{ "struct_field1": "struct_value",  "struct_field2": "struct_value" }, 
{ "struct_field1": "struct_value",  "struct_field2": "struct_value" } 
]

I want to apply a udf to each field of the structs. I am currently trying to do this using a transform however does not seem to work because the udf is not receiving the context.

The error I get is "Cannot generate code for expression: <lambda>(lambda x_1#123.struct_field1)#45678"

df.select(transform("column", lambda x: struct( 
  my_udf_for(x.struct_field1).alias("struct_field1"),
  my_udf_for(x.struct_field2).alias("struct_field2"),
)).alias("column"))

How do I nest a udf inside a transform?

2 REPLIES 2

Hi Kaniz,

Thank you for your response. However, it does not look like your code will compile. You reference the udf within SQL without registering the udf. Also you seem to be mixing pyspark code within the SQL query, where you use alias.

Even if I fix these issues with your code, it still does not execute and I get the same error:

SparkUnsupportedOperationException: [INTERNAL ERROR] Cannot generate code for expression: my_udf(lambda x#306.struct_field1)#307

Appreciate if you could advise if this is expected behaviour or if the functionality is supported.

Anonymous
Not applicable

@Richard Belihomji​ : Please try this

To apply a UDF to a property in an array of structs using PySpark, you can define your UDF as a Python function and register it using the udf method from pyspark.sql.functions. Then, you can use the getItem method to extract the value of a particular field from the struct, and pass it as an argument to your UDF.

Here's an example code snippet that shows how to do this:

from pyspark.sql.functions import udf, struct, col
 
# define your UDF
@udf
def my_udf(x):
    return x.upper()
 
# apply the UDF to the struct_field1 property in the array of structs
df = df.withColumn("column", 
                   transform(col("column"), 
                             lambda x: struct(
                                 my_udf(x.getItem("struct_field1")).alias("struct_field1"), 
                                 x.getItem("struct_field2").alias("struct_field2"))))

In this example, we define a UDF called my_udf that converts the input string to uppercase. We then use the withColumn method to apply the transform function to the column array. In the lambda function passed to transform, we use the getItem method to extract the value of the struct_field1 property, and pass it as an argument to my_udf. We then use the alias method to rename the resulting column to struct_field1. Similarly, we extract the struct_field2 property using

getItem, and rename it using alias.

Note that it's important to register the UDF using the @udf decorator, as this allows PySpark to infer the return type of the UDF. Without this, you may encounter errors or performance issues.

I hope this helps, and please let me know if you have any further questions or concerns.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group