We have UDFs in a few locations and today we noticed they died in performance. This seems to be caused by Unity Catalog.
Test environment 1:
- Databricks Runtime Environment: 14.3 / 15.1
- Compute: 1 master, 4 nodes
- Policy: Unrestricted
- Access Mode: Shared
Test environment 2:
- Databricks Runtime Environment: 14.3 / 15.1
- Compute: Single Node
- Policy: Unrestricted
- Access Mode: Single user
Code:
import pandas as pd
import numpy as np
import pyspark.sql.functions as F
import pyspark.sql.types as T
# Create test dataframe:
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
sdf = spark.createDataFrame(df)
sdf.writeTo('test.playground.abcd').createOrReplace()
# Now load from unity catalog and apply UDF:
def squared(x):
return x * x
squared_udf = F.udf(squared, T.LongType())
sdf_2 = spark.read.table('test.playground.abcd')
sdf_2.withColumn('sq', squared_udf('A')).display()
Performance:
- Test environment 1: 2 min 55s
- Test environment 2: 8s