Databricks Community

Vibhor · ‎10-28-2021

Facing an issue with cluster performance, in event log can see - cluster is not responsive likely due to GC. Number of pipeline (databricks notebooks) running and cluster configuration is same as it used to be before but started seeing this issue since 1 week. what could be other possible root cause other than memory or load on cluster that can cause this and how do we resolve it.

jose_gonzalez · ‎11-15-2021

hi @Vibhor Sethi ,

You can check the Spark UI --> SQL sub tab. In this SQL tab you will be able to find the job duration, physical plan, query execution DAG. Docs here

You need to check the DAG and details of the query execution. It will give you the details on the table scan, number of rows, data size, etc. You need to compare this DAG details with the Spark job that run fine in the past and the new one that is stuck.

View solution in original post

Piper_Wilson · ‎10-28-2021

Hello @Vibhor Sethi - My name is Piper and I'm one of the moderators for Databricks. Welcome and thank you for your question. Let's give it a bit longer to see what the community has to say. Otherwise, we'll circle back around.

jose_gonzalez · ‎11-08-2021

Hi @Vibhor Sethi ,

Do you see any other error messages? did you data volume increase? what kind of job are you running?

Vibhor · ‎11-12-2021

Hi @Jose Gonzalez - No dont see error message just see its stuck sometimes in running first cell only which will take second to run manually. we have distributed env with multiple datasets and interactive clusters so difficult to analyse. Is there any way that via spark UI etc we can analyse if data volume increased etc?

jose_gonzalez · ‎11-15-2021

hi @Vibhor Sethi ,

You can check the Spark UI --> SQL sub tab. In this SQL tab you will be able to find the job duration, physical plan, query execution DAG. Docs here

You need to check the DAG and details of the query execution. It will give you the details on the table scan, number of rows, data size, etc. You need to compare this DAG details with the Spark job that run fine in the past and the new one that is stuck.

JKR · ‎08-24-2023

Hi @jose_gonzalez My problem is similar to this.
I have 5 interactive clusters running and 24 jobs running on each cluster after every 5 minutes, cannot use job clusters here due to start-up-time and it would miss 5 min SLA.

Jobs are running fine but sometimes at random some jobs from all the clusters stuck for some time and then continue to finish and we miss the 5 minute SLA for the next run, but our we have written logic so that it will execute both skip and current 5 minute window run in next single job run. It happens with all the jobs which are scheduled on different clusters.

My only issue is all jobs are running fine, why does it happen suddenly with all the clusters and then in next run is fine again. there is no information available in stderr, stdout and log4J logs. and it is so difficult to check in the log4j logs due concurrent 24 jobs are executing on the cluster.
Is there any way to check what is taking time behind the scenes when it is showing "command is submitted for execution" on notebook cells.
e.g: I have 2 cells in main notebook and each cell both notebook cells taking less than 1 minute to finish when it is running fine and it shows job duration as per the cells execution time on the right panel of the job run.
But sometime I observed that both notebook cells took same time to execute cells but it is showing job duration 5 minutes of the job instead of 1 minute which is the actual total execution time of both the cells.

And as for the DAG, it disappears after some time from spark UI, so how can we check in spark UI for old runs.

Databricks Community

Cluster Performance

Join Us as a Local Community Builder!

Lakehouse, Lagers & Legends — Bangalore Meetup | December 13

🌟 Community Pulse: Your Weekly Roundup! November 21 – 27, 2025

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐