Databricks Community

cool_cool_cool · ‎10-21-2024

Heya 🙂

I have a workflow in databricks with 2 tasks. They are configured to run on the same job cluster, and the second task depends on the first.
I have a weird behavior that happened twice now - the job takes a long time (it usually finishes within 30 minutes) but it has been running for more than 10 hours. The weird behavior is that the first task is on "Running" state, but when I look at the spark UI I dont see any jobs/stages/tasks/sql queries - expect from the fact that all of the executers are up and running.

In both cases I saw the following message in the error logs:

```

appcds_setup elapsed time: 0.000
ANTLR Tool version 4.8 used for code generation does not match the current runtime version 4.9.3
ANTLR Tool version 4.8 used for code generation does not match the current runtime version 4.9.3
ANTLR Tool version 4.8 used for code generation does not match the current runtime version 4.9.3
ANTLR Tool version 4.8 used for code generation does not match the current runtime version 4.9.3
Tue Oct 15 06:08:16 2024 Connection to spark from PID 1478
Tue Oct 15 06:08:16 2024 Initialized gateway on port 38197
Tue Oct 15 06:08:17 2024 Connected to spark.
Tue Oct 15 06:08:23 2024 Connection to spark from PID 1572
Tue Oct 15 06:08:23 2024 Initialized gateway on port 45679
Tue Oct 15 06:08:23 2024 Connected to spark.
ERROR:root:KeyboardInterrupt while sending command.
Traceback (most recent call last):
File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1038, in send_command
response = connection.send_command(command)
File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py", line 536, in send_command
answer = smart_decode(self.stream.readline()[:-1])
File "/usr/lib/python3.10/socket.py", line 705, in readinto
return self._sock.recv_into(b)
KeyboardInterrupt

```

This workflow is scheduled to run every 2 hours, and it usually works fine, but it the last 3 days or so it happened twice and I didnt find anything about it.

Any ideas?

VZLA · ‎10-31-2024

Given the provided context, the suggestion is to capture thread dumps from both the Spark Driver and any Active Executor when the task seems to be hung. Ideally, you should also be able to find in the Spark logs for the active executor with the hung task, coming from a HangTaskDetector class which also captures the threaddumps that are ready for analysis. These will give you some insight on why is the task hung (...or slowly progressing). So you want to focus on the executor logs instead of Driver's.

The output log shared is not very relevant, it only shows two instances of a spark driver initialization followed by the cancellation command.

Sri_M · ‎05-19-2025

@cool_cool_cool I am facing same issue as well.
Is this issue resolved for you? If yes, can you please let me know what action have you taken?

cool_cool_cool · ‎06-19-2025

Heya, I've found this thread in reddit - https://www.reddit.com/r/databricks/s/5mrgVHbFQE

I've had some issues with changing the isolation level and unity catalog but you should try it, let me know if it worked for you

Databricks Community

Databricks Workflow is stuck on the first task and doesnt do anyworkload

Join Us as a Local Community Builder!

🌟 Community Pulse: Your Weekly Roundup! November 28 – December 04, 2025

Jaipur Usergroup First Virtual Meetup: AI/BI Genie + Data Science Careers — 19 Dec | 6 PM IST

Lakehouse, Lagers & Legends — Bangalore Meetup | December 13

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐