Databricks Community

RichardDriven · ‎04-19-2023

I have a column that contains an array of structs as follows:

"column" : [ 
{ "struct_field1": "struct_value",  "struct_field2": "struct_value" }, 
{ "struct_field1": "struct_value",  "struct_field2": "struct_value" } 
]

I want to apply a udf to each field of the structs. I am currently trying to do this using a transform however does not seem to work because the udf is not receiving the context.

The error I get is "Cannot generate code for expression: <lambda>(lambda x_1#123.struct_field1)#45678"

df.select(transform("column", lambda x: struct( 
  my_udf_for(x.struct_field1).alias("struct_field1"),
  my_udf_for(x.struct_field2).alias("struct_field2"),
)).alias("column"))

How do I nest a udf inside a transform?

RichardDriven · ‎04-20-2023

Hi Kaniz,

Thank you for your response. However, it does not look like your code will compile. You reference the udf within SQL without registering the udf. Also you seem to be mixing pyspark code within the SQL query, where you use alias.

Even if I fix these issues with your code, it still does not execute and I get the same error:

SparkUnsupportedOperationException: [INTERNAL ERROR] Cannot generate code for expression: my_udf(lambda x#306.struct_field1)#307

Appreciate if you could advise if this is expected behaviour or if the functionality is supported.

Anonymous · ‎04-24-2023

@Richard Belihomji : Please try this

To apply a UDF to a property in an array of structs using PySpark, you can define your UDF as a Python function and register it using the udf method from pyspark.sql.functions. Then, you can use the getItem method to extract the value of a particular field from the struct, and pass it as an argument to your UDF.

Here's an example code snippet that shows how to do this:

from pyspark.sql.functions import udf, struct, col
 
# define your UDF
@udf
def my_udf(x):
    return x.upper()
 
# apply the UDF to the struct_field1 property in the array of structs
df = df.withColumn("column", 
                   transform(col("column"), 
                             lambda x: struct(
                                 my_udf(x.getItem("struct_field1")).alias("struct_field1"), 
                                 x.getItem("struct_field2").alias("struct_field2"))))

In this example, we define a UDF called my_udf that converts the input string to uppercase. We then use the withColumn method to apply the transform function to the column array. In the lambda function passed to transform, we use the getItem method to extract the value of the struct_field1 property, and pass it as an argument to my_udf. We then use the alias method to rename the resulting column to struct_field1. Similarly, we extract the struct_field2 property using

getItem, and rename it using alias.

Note that it's important to register the UDF using the @udf decorator, as this allows PySpark to infer the return type of the UDF. Without this, you may encounter errors or performance issues.

I hope this helps, and please let me know if you have any further questions or concerns.

Databricks Community

How to apply a UDF to a property in an array of structs

Connect with Databricks Users in Your Area

Introducing an exclusively Databricks-hosted Assistant

How to present and share your Notebook insights in AI/BI Dashboards

Meet the Databricks MVPs

Now Hiring: Databricks Community Technical Moderator

Insights from a global survey of 1,100 technologists and interviews with 28 CIOs