Hi all,
Iām experiencing a significant slowdown behavior in Python UDF execution times on a particular cluster. The same code runs much faster on another cluster with very similar hardware and policy settings.
This cell takes 2ā3 minutes on the problematic cluster, but only 10ā30 seconds on the previous cluster we had in the workspace with no UC.
# example from https://docs.databricks.com/aws/en/udf/unity-catalog
def squared(s):
return s * s
spark.udf.register("squaredWithPython", squared)
spark.range(1, 20).createOrReplaceTempView("test")
from pyspark.sql.functions import udf
from pyspark.sql.types import LongType
squared_udf = udf(squared, LongType())
df = spark.table("test")
display(df.select("id", squared_udf("id").alias("id_squared")))
# This cell takes 2ā3 minutes on the problematic cluster, but only 10ā30 seconds on the other.
The cell with an intentionally incorrect schema error runs instantly with an error the first time, but starting from the second or third run, it can take up to 10 minutes to fail.
from pyspark.sql.functions import udf
from pyspark.sql import types as T
def test_func():
return 0
# correct schema
# schema = T.IntegerType()
# schema with intentional error
schema = T.StructType()
test_udf = udf(test_func, schema)
df_test = spark.createDataFrame([("test",)], ["col1"])
display(
df_test.withColumn("udf_result", test_udf())
)
Cluster config:
Policy: Unrestricted
Node type: rd-fleet.xlarge (32 GB, 4 Cores)
Workers: Min 1, Max 2 (current: 1)
Driver: rd-fleet.xlarge (32 GB, 4 Cores)
Access mode: Standard (Shared)
Runtime: 15.4 LTS (Spark 3.5.0, Scala 2.12)
Autoscaling: Enabled
Photon: Off
Auto-termination: 20 min
Notes:
- All timings are observed when the cluster is already running and there are no other jobs or notebooks running in parallel.
No matter what I tried ā such as renaming the UDF or using the udf decorator ā after the first quick run with the schema error, all further runs of such cell take an extremely long time before the error is shown.
Detaching and re-attaching the notebook does not help. I need to restart the cluster to resolve the issue for a single cell run, but the problem returns after running the cell again.
I donāt have access to cluster logs on the problematic cluster, but I can create new clusters for jobs and view their logs.
I tried creating a new cluster with similar default configurations
observed the same issues with simple code in first code block above: took 2-3 minutes to run.
When running as a job, it fails on a JVM exception, so I havenāt found a way to make the cells with the wrong schema run twice to test if the long computation time occurs on subsequent runs.
Questions:
What can cause such slowdowns (10x or more) for simple UDFs or error feedback?
What log/event should I look for if I can get jobs logs where issue also observed?
Any tips for diagnosing this further?