cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Cluster Performance

Vibhor
Contributor

Facing an issue with cluster performance, in event log can see - cluster is not responsive likely due to GC. Number of pipeline (databricks notebooks) running and cluster configuration is same as it used to be before but started seeing this issue since 1 week. what could be other possible root cause other than memory or load on cluster that can cause this and how do we resolve it.

1 ACCEPTED SOLUTION

Accepted Solutions

jose_gonzalez
Databricks Employee
Databricks Employee

hi @Vibhor Sethi​ ,

You can check the Spark UI --> SQL sub tab. In this SQL tab you will be able to find the job duration, physical plan, query execution DAG. Docs here

You need to check the DAG and details of the query execution. It will give you the details on the table scan, number of rows, data size, etc. You need to compare this DAG details with the Spark job that run fine in the past and the new one that is stuck.

View solution in original post

5 REPLIES 5

Piper_Wilson
New Contributor III

Hello @Vibhor Sethi​ - My name is Piper and I'm one of the moderators for Databricks. Welcome and thank you for your question. Let's give it a bit longer to see what the community has to say. Otherwise, we'll circle back around.

jose_gonzalez
Databricks Employee
Databricks Employee

Hi @Vibhor Sethi​ ,

Do you see any other error messages? did you data volume increase? what kind of job are you running?

Hi @Jose Gonzalez​ - No dont see error message just see its stuck sometimes in running first cell only which will take second to run manually. we have distributed env with multiple datasets and interactive clusters so difficult to analyse. Is there any way that via spark UI etc we can analyse if data volume increased etc?

jose_gonzalez
Databricks Employee
Databricks Employee

hi @Vibhor Sethi​ ,

You can check the Spark UI --> SQL sub tab. In this SQL tab you will be able to find the job duration, physical plan, query execution DAG. Docs here

You need to check the DAG and details of the query execution. It will give you the details on the table scan, number of rows, data size, etc. You need to compare this DAG details with the Spark job that run fine in the past and the new one that is stuck.

Hi @jose_gonzalez My problem is similar to this.
I have 5 interactive clusters running and 24 jobs running on each cluster after every 5 minutes, cannot use job clusters here due to start-up-time and it would miss 5 min SLA. 

Jobs are running fine but sometimes at random some jobs from all the clusters stuck for some time and then continue to finish and we miss the 5 minute SLA for the next run, but our we have written logic so that it will execute both skip and current 5 minute window run in next single job run. It happens with all the jobs which are scheduled on different clusters.

My only issue is all jobs are running fine, why does it happen suddenly with all the clusters and then in next run is fine again. there is no information available in stderr, stdout and log4J logs. and it is so difficult to check in the log4j logs due concurrent 24 jobs are executing on the cluster.
Is there any way to check what is taking time behind the scenes when it is showing "command is submitted for execution" on notebook cells.
e.g: I have 2 cells in main notebook and each cell both notebook cells taking less than 1 minute to finish when it is running fine and it shows job duration as per the cells execution time on the right panel of the job run.
But sometime I observed that both notebook cells took same time to execute cells but it is showing job duration 5 minutes of the job instead of 1 minute which is the actual total execution time of both the cells. 

And as for the DAG, it disappears after some time from spark UI, so how can we check in spark UI for old runs.  

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group