cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

udf in databricks

Phani1
Valued Contributor II

Hi Team,

Is there a particular reason why we should avoid using UDF and instead convert to DataFrame code?
Are there any restrictions or limitations (in terms of performance or governance) when using UDFs in Databricks?

 

Regards,

Janga

1 REPLY 1

Walter_C
Databricks Employee
Databricks Employee

Hello some of the things you need to take in consideration is that:

UDFs might introduce significant processing bottlenecks into code execution. Databricks uses a number of different optimizers automatically for code written with included Apache Spark, SQL, and Delta Lake syntax. When custom logic is introduced by UDFs, these optimizers do not have the ability to efficiently plan tasks around this custom logic. In addition, logic that executes outside the JVM has additional costs around data serialization.
You can refer to: https://docs.databricks.com/en/udf/index.html#which-udfs-are-most-efficient to understand which UDFs can be more efficient and help you in your activities.

There are certain limitations when using UDFs in shared access mode on Unity Catalog. For instance, Hive UDFs are not supported, and applyInPandas and mapInPandas are not supported in Databricks Runtime 14.2 and below. In Databricks Runtime 14.2 and above, Scala scalar UDFs are supported, but other Scala UDFs and UDAFs are not. Python scalar UDFs and Pandas UDFs are supported in Databricks Runtime 13.3 LTS and above, but other Python UDFs, including UDAFs, UDTFs, and Pandas on Spark are not supported.

When should you use a UDF?

A major benefit of UDFs is that they allow users to express logic in familiar languages, reducing the human cost associated with refactoring code. For ad hoc queries, manual data cleansing, exploratory data analysis, and most operations on small or medium-sized datasets, latency overhead costs associated with UDFs are unlikely to outweigh costs associated with refactoring code.

For ETL jobs, streaming operations, operations on very large datasets, or other workloads that are executed regularly or continuously, refactoring logic to use native Apache Spark methods quickly pays dividends.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group