Databricks Community

Phani1 · ‎04-26-2024

Hi Team,

Is there a particular reason why we should avoid using UDF and instead convert to DataFrame code?
Are there any restrictions or limitations (in terms of performance or governance) when using UDFs in Databricks?

Regards,

Janga

Walter_C · ‎04-27-2024

Hello some of the things you need to take in consideration is that:

UDFs might introduce significant processing bottlenecks into code execution. Databricks uses a number of different optimizers automatically for code written with included Apache Spark, SQL, and Delta Lake syntax. When custom logic is introduced by UDFs, these optimizers do not have the ability to efficiently plan tasks around this custom logic. In addition, logic that executes outside the JVM has additional costs around data serialization.
You can refer to: https://docs.databricks.com/en/udf/index.html#which-udfs-are-most-efficient to understand which UDFs can be more efficient and help you in your activities.

There are certain limitations when using UDFs in shared access mode on Unity Catalog. For instance, Hive UDFs are not supported, and applyInPandas and mapInPandas are not supported in Databricks Runtime 14.2 and below. In Databricks Runtime 14.2 and above, Scala scalar UDFs are supported, but other Scala UDFs and UDAFs are not. Python scalar UDFs and Pandas UDFs are supported in Databricks Runtime 13.3 LTS and above, but other Python UDFs, including UDAFs, UDTFs, and Pandas on Spark are not supported.

When should you use a UDF?

A major benefit of UDFs is that they allow users to express logic in familiar languages, reducing the human cost associated with refactoring code. For ad hoc queries, manual data cleansing, exploratory data analysis, and most operations on small or medium-sized datasets, latency overhead costs associated with UDFs are unlikely to outweigh costs associated with refactoring code.

For ETL jobs, streaming operations, operations on very large datasets, or other workloads that are executed regularly or continuously, refactoring logic to use native Apache Spark methods quickly pays dividends.