cancel
Showing results forĀ 
Search instead forĀ 
Did you mean:Ā 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forĀ 
Search instead forĀ 
Did you mean:Ā 

Very Slow UDF Execution on One Cluster Compared to Another with Similar Config

alexbarev
New Contributor II

Hi all,

I’m experiencing a significant slowdown behavior in Python UDF execution times on a particular cluster. The same code runs much faster on another cluster with very similar hardware and policy settings.

This cell takes 2–3 minutes on the problematic cluster, but only 10–30 seconds on the previous cluster we had in the workspace with no UC.

# example from https://docs.databricks.com/aws/en/udf/unity-catalog

def squared(s):
    return s * s

spark.udf.register("squaredWithPython", squared)
spark.range(1, 20).createOrReplaceTempView("test")

from pyspark.sql.functions import udf
from pyspark.sql.types import LongType

squared_udf = udf(squared, LongType())
df = spark.table("test")
display(df.select("id", squared_udf("id").alias("id_squared")))

# This cell takes 2–3 minutes on the problematic cluster, but only 10–30 seconds on the other.

 

The cell with an intentionally incorrect schema error runs instantly with an error the first time, but starting from the second or third run, it can take up to 10 minutes to fail.

 

from pyspark.sql.functions import udf
from pyspark.sql import types as T

def test_func():
    return 0     

# correct schema
# schema = T.IntegerType()

# schema with intentional error
schema = T.StructType()

test_udf = udf(test_func, schema)

df_test = spark.createDataFrame([("test",)], ["col1"])
display(
    df_test.withColumn("udf_result", test_udf())
)

 

 

Cluster config:

  • Policy: Unrestricted

  • Node type: rd-fleet.xlarge (32 GB, 4 Cores)

  • Workers: Min 1, Max 2 (current: 1)

  • Driver: rd-fleet.xlarge (32 GB, 4 Cores)

  • Access mode: Standard (Shared)

  • Runtime: 15.4 LTS (Spark 3.5.0, Scala 2.12)

  • Autoscaling: Enabled

  • Photon: Off

  • Auto-termination: 20 min

Notes:

  • All timings are observed when the cluster is already running and there are no other jobs or notebooks running in parallel.
  • No matter what I tried — such as renaming the UDF or using the udf decorator — after the first quick run with the schema error, all further runs of such cell take an extremely long time before the error is shown.

  • Detaching and re-attaching the notebook does not help. I need to restart the cluster to resolve the issue for a single cell run, but the problem returns after running the cell again.

  • I don’t have access to cluster logs on the problematic cluster, but I can create new clusters for jobs and view their logs.

    • I tried creating a new cluster with similar default configurations

    • observed the same issues with simple code in first code block above: took 2-3 minutes to run.

    • When running as a job, it fails on a JVM exception, so I haven’t found a way to make the cells with the wrong schema run twice to test if the long computation time occurs on subsequent runs.

Questions:

  • What can cause such slowdowns (10x or more) for simple UDFs or error feedback?

  • What log/event should I look for if I can get jobs logs where issue also observed?

  • Any tips for diagnosing this further?

2 REPLIES 2

alexbarev
New Contributor II

Our infra team told me we might face strange databricks bug. It happens only in our team workspace. Other teams do not experience bug with clusters that have identical configuration.

Also when I run jobs with identical settings as in our cluster, but Policy: Job Compute - Single node and Access mode: Dedicated (formerly Single user) – issue dissapears.

But when I use here Access mode: Standard (formerly: Shared) – as in our cluster – problem persist.

SP_6721
Contributor III

Hi @alexbarev ,

The slowdown is likely due to using Python UDFs on a Shared (Standard) access mode cluster with Unity Catalog, which adds extra security and isolation overhead. Using a Dedicated access mode cluster removes the extra isolation overhead from Unity Catalog, which typically resolves the UDF performance issues.

To further improve performance:

  • Enable spark.sql.execution.pythonUDF.arrow.enabled = true in cluster settings.
  • Check the Spark UI for task delays or scheduler bottlenecks related to UDFs.
  • Review job logs for high serialization/deserialization times.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now