cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How and when to capture the thread dump of the Spark driver?

brickster_2018
Databricks Employee
Databricks Employee

What is the best way to capture the thread dump of the Spark driver process. Also, when should I capture the thread dump?

1 ACCEPTED SOLUTION

Accepted Solutions

brickster_2018
Databricks Employee
Databricks Employee

Steps to collect the thread dump(executor)

  • Go to the cluster where the job is running and click on spark UI
  • Traverse to the stuck task in the spark UI by clicking on the long-running job -> long-running stage -> tasks
  • On the tasks page, please note down the "Task ID" and the "Host" where the task is stuck
  • Now click on the "Executors" tab in the Spark UI. Click on "Thread dump" against the corresponding host IP where the stuck task is running
  • On this screen, you will see the list of active threads running. Click on the thread which contains the "task id" you noted in step 3
  • Going through the thread, you will be able to find out which class and which function is getting executed
  • To confirm that a task is stuck, a thread dump has to be taken once in 2 minutes for 5-6 iterations. If all the collected thread dumps look the same, then we can confirm that the task is stuck.
  • At times, it may be an external library that you have attached that might be causing the thread to get stuck or at times the job might be running in infinite loops. Based on the cause, corrective actions can be taken.

View solution in original post

2 REPLIES 2

brickster_2018
Databricks Employee
Databricks Employee

Steps to collect the thread dump(executor)

  • Go to the cluster where the job is running and click on spark UI
  • Traverse to the stuck task in the spark UI by clicking on the long-running job -> long-running stage -> tasks
  • On the tasks page, please note down the "Task ID" and the "Host" where the task is stuck
  • Now click on the "Executors" tab in the Spark UI. Click on "Thread dump" against the corresponding host IP where the stuck task is running
  • On this screen, you will see the list of active threads running. Click on the thread which contains the "task id" you noted in step 3
  • Going through the thread, you will be able to find out which class and which function is getting executed
  • To confirm that a task is stuck, a thread dump has to be taken once in 2 minutes for 5-6 iterations. If all the collected thread dumps look the same, then we can confirm that the task is stuck.
  • At times, it may be an external library that you have attached that might be causing the thread to get stuck or at times the job might be running in infinite loops. Based on the cause, corrective actions can be taken.

brickster_2018
Databricks Employee
Databricks Employee

For Spark driver the process is the same. Choose the driver from the Executor page and view the thread dump.

A thread dump is the footprints of the JVM they are very useful in debugging issues where the JVM process is stuck or making extremely slow progress.

Thread dump collection is considered as an advanced troubleshooting technquie.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group