cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks Workflow is stuck on the first task and doesnt do anyworkload

cool_cool_cool
New Contributor II

Heya 🙂

I have a workflow in databricks with 2 tasks. They are configured to run on the same job cluster, and the second task depends on the first.
I have a weird behavior that happened twice now - the job takes a long time (it usually finishes within 30 minutes) but it has been running for more than 10 hours. The weird behavior is that the first task is on "Running" state, but when I look at the spark UI I dont see any jobs/stages/tasks/sql queries - expect from the fact that all of the executers are up and running.

In both cases I saw the following message in the error logs:

```

appcds_setup elapsed time: 0.000
ANTLR Tool version 4.8 used for code generation does not match the current runtime version 4.9.3
ANTLR Tool version 4.8 used for code generation does not match the current runtime version 4.9.3
ANTLR Tool version 4.8 used for code generation does not match the current runtime version 4.9.3
ANTLR Tool version 4.8 used for code generation does not match the current runtime version 4.9.3
Tue Oct 15 06:08:16 2024 Connection to spark from PID 1478
Tue Oct 15 06:08:16 2024 Initialized gateway on port 38197
Tue Oct 15 06:08:17 2024 Connected to spark.
Tue Oct 15 06:08:23 2024 Connection to spark from PID 1572
Tue Oct 15 06:08:23 2024 Initialized gateway on port 45679
Tue Oct 15 06:08:23 2024 Connected to spark.
ERROR:root:KeyboardInterrupt while sending command.
Traceback (most recent call last):
File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1038, in send_command
response = connection.send_command(command)
File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/clientserver.py", line 536, in send_command
answer = smart_decode(self.stream.readline()[:-1])
File "/usr/lib/python3.10/socket.py", line 705, in readinto
return self._sock.recv_into(b)
KeyboardInterrupt

```

 

This workflow is scheduled to run every 2 hours, and it usually works fine, but it the last 3 days or so it happened twice and I didnt find anything about it.

Any ideas?

1 REPLY 1

VZLA
Databricks Employee
Databricks Employee

Given the provided context, the suggestion is to capture thread dumps from both the Spark Driver and any Active Executor when the task seems to be hung. Ideally, you should also be able to find in the Spark logs for the active executor with the hung task, coming from a HangTaskDetector class which also captures the threaddumps that are ready for analysis. These will give you some insight on why is the task hung (...or slowly progressing). So you want to focus on the executor logs instead of Driver's.

The output log shared is not very relevant, it only shows two instances of a spark driver initialization followed by the cancellation command.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group