cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

[UDF_MAX_COUNT_EXCEEDED] Exceeded query-wide UDF limit of 5 UDFs

Yaacoub
New Contributor

In my project I defined a UDF:

 

@udf(returnType=IntegerType())
def ends_with_one(value, bit_position):
    if bit_position + len(value) < 0: 
        return 0
    else:
        return int(value[bit_position] == '1')

spark.udf.register("ends_with_one", ends_with_one)

 

But somehow instead of registering the UDF once, it get's registered every time I call it:

 

df = df.withColumn('Ends_With_One', ends_with_one(col('Column_To_Check'), lit(-1)))

 

And after a few function calls I get the following error message:

 

[UDF_MAX_COUNT_EXCEEDED] Exceeded query-wide UDF limit of 5 UDFs (limited during public preview). Found 6. The UDFs were: `ends_with_one`,`ends_with_one`,`ends_with_one`,`ends_with_one`,`ends_with_one`,`ends_with_one`.

 

I spent a lot of time researching but I couldn't find my mistake.

3 REPLIES 3

Kaniz
Community Manager
Community Manager

Hi @Yaacoub , 

The reason why your UDF gets registered every time you call it is because Spark registers UDFs at the session level, not the notebook level. This meansIf you are registering a UDF multiple times, it can result in exceeding the maximum number UDFs allowed per query, which is 5 in the public preview of Databricks SQL. To resolve this issue, you can define the UDF outside of the loop or function that is calling it.

Here is an example of how you can modify your code to register the UDF only once:

from pyspark.sql.functions import udf

@udf(returnType=IntegerType())
def ends_with_one(value, bit_position):
    if bit_position + len(value) < 0: 
        return 0
    else:
        return int(value[bit_position] == '1')

spark.udf.register("ends_with_one", ends_with_one)

# You can call the registered UDF in a loop, function or elsewhere

df = df.withColumn('Ends_With_One', ends_with_one(col('Column_To_Check'), lit(-1)))

By defining the UDF outside of the loop or function that is calling it, you will register the UDF only once and prevent it from being registered multiple times by repeated calls to the function.

Also, keep in mind that using too many UDFs can negatively impact query performance, so it's generally a good practice to use built-in Spark functions or DataFrame API operations whenever possible to achieve the same results for better performance.

jose_gonzalez
Moderator
Moderator

Hi @Yaacoub,

Just a friendly follow-up. Have you had a chance to review my colleague's reply? Please inform us if it contributes to resolving your query.

I used the proposed solution by defining the UDF outside of the loop, but I still got the same error. I run the same code on Azure Synapse without any problem. I would appreciate it if you could assist me in how I can address the UDF problem.