cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Administration & Architecture
Explore discussions on Databricks administration, deployment strategies, and architectural best practices. Connect with administrators and architects to optimize your Databricks environment for performance, scalability, and security.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Task Hanging issue on DBR 15.4

Dharma25
New Contributor II

Hello,

I am running strucutred streaming pipeline with 5 models loaded using pyfunc.spark_udf. Lately we have been noticing very strange issue of tasks getting hanged and batch is taking very long time finishing its execution.

CPU utilization is around 90% and Memory utilization is steady.

Issue:

โ€ƒ



Configs:
DBR 15.4
Job Compute
1 driver and 4 workers

Screenshot 2025-11-27 at 10.24.41โ€ฏPM.png

2 REPLIES 2

bianca_unifeye
New Contributor III

On DBR 15.4 the DeadlockDetector: TASK_HANGING message usually just means Spark has noticed some very long-running tasks and is checking for deadlocks. With multiple pyfunc.spark_udf models in a streaming query the tasks often appear โ€œstuckโ€ because the Python UDF is blocking (heavy model inference, external calls, or GIL contention) while CPU stays high and memory steady.

Iโ€™d suggest:
โ€“ checking the Structured Streaming metrics to see if the batch is still progressing,
โ€“ taking executor thread dumps to confirm threads are blocked inside the UDF,
โ€“ testing the pipeline with fewer models / simplified UDFs to isolate which one causes the hang,
โ€“ making sure models are loaded once per executor and not doing network/I/O per row, and, if possible, moving to vectorised / Pandas UDFs.

If the same code works on an older LTS runtime ( try to run on 14.3 or even an older one) but hangs on 15.4, it may be a runtime regression and worth raising with Databricks Support including the job and run IDs.

Thank you very much for your recommendations.

Additionally, I noticed that each executor typically has 32 active tasks by default. However, when looking at the test execution summary tab under the DAG for various stages, it displays 300 tasks.

Moreover, I found that executing `coalesce(1)` and then distributing it across all 10 models significantly improves performance, with batches running much faster.