BUG: Unity Catalog kills UDF

Data Engineering

Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.

We have UDFs in a few locations and today we noticed they died in performance. This seems to be caused by Unity Catalog.

Test environment 1:

Databricks Runtime Environment: 14.3 / 15.1
Compute: 1 master, 4 nodes
Policy: Unrestricted
Access Mode: Shared

Test environment 2:

Databricks Runtime Environment: 14.3 / 15.1
Compute: Single Node
Policy: Unrestricted
Access Mode: Single user

Code:

import pandas as pd
import numpy as np
import pyspark.sql.functions as F
import pyspark.sql.types as T

# Create test dataframe:
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
sdf = spark.createDataFrame(df)
sdf.writeTo('test.playground.abcd').createOrReplace()

# Now load from unity catalog and apply UDF:
def squared(x):
    return x * x

squared_udf = F.udf(squared, T.LongType())

sdf_2 = spark.read.table('test.playground.abcd')
sdf_2.withColumn('sq', squared_udf('A')).display()

Performance: