cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

udf in databricks

Phani1
Valued Contributor

Hi Team,

Is there a particular reason why we should avoid using UDF and instead convert to DataFrame code?
Are there any restrictions or limitations (in terms of performance or governance) when using UDFs in Databricks?

 

Regards,

Janga

1 REPLY 1

Walter_C
Honored Contributor
Honored Contributor

Hello some of the things you need to take in consideration is that:

UDFs might introduce significant processing bottlenecks into code execution. Databricks uses a number of different optimizers automatically for code written with included Apache Spark, SQL, and Delta Lake syntax. When custom logic is introduced by UDFs, these optimizers do not have the ability to efficiently plan tasks around this custom logic. In addition, logic that executes outside the JVM has additional costs around data serialization.
You can refer to: https://docs.databricks.com/en/udf/index.html#which-udfs-are-most-efficient to understand which UDFs can be more efficient and help you in your activities.

There are certain limitations when using UDFs in shared access mode on Unity Catalog. For instance, Hive UDFs are not supported, and applyInPandas and mapInPandas are not supported in Databricks Runtime 14.2 and below. In Databricks Runtime 14.2 and above, Scala scalar UDFs are supported, but other Scala UDFs and UDAFs are not. Python scalar UDFs and Pandas UDFs are supported in Databricks Runtime 13.3 LTS and above, but other Python UDFs, including UDAFs, UDTFs, and Pandas on Spark are not supported.

When should you use a UDF?

A major benefit of UDFs is that they allow users to express logic in familiar languages, reducing the human cost associated with refactoring code. For ad hoc queries, manual data cleansing, exploratory data analysis, and most operations on small or medium-sized datasets, latency overhead costs associated with UDFs are unlikely to outweigh costs associated with refactoring code.

For ETL jobs, streaming operations, operations on very large datasets, or other workloads that are executed regularly or continuously, refactoring logic to use native Apache Spark methods quickly pays dividends.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!