10-28-2021 10:14 AM
Facing an issue with cluster performance, in event log can see - cluster is not responsive likely due to GC. Number of pipeline (databricks notebooks) running and cluster configuration is same as it used to be before but started seeing this issue since 1 week. what could be other possible root cause other than memory or load on cluster that can cause this and how do we resolve it.
11-15-2021 10:42 AM
hi @Vibhor Sethi ,
You can check the Spark UI --> SQL sub tab. In this SQL tab you will be able to find the job duration, physical plan, query execution DAG. Docs here
You need to check the DAG and details of the query execution. It will give you the details on the table scan, number of rows, data size, etc. You need to compare this DAG details with the Spark job that run fine in the past and the new one that is stuck.
10-28-2021 08:01 PM
Hello @Vibhor Sethi - My name is Piper and I'm one of the moderators for Databricks. Welcome and thank you for your question. Let's give it a bit longer to see what the community has to say. Otherwise, we'll circle back around.
11-08-2021 01:22 PM
Hi @Vibhor Sethi ,
Do you see any other error messages? did you data volume increase? what kind of job are you running?
11-12-2021 09:25 AM
Hi @Jose Gonzalez - No dont see error message just see its stuck sometimes in running first cell only which will take second to run manually. we have distributed env with multiple datasets and interactive clusters so difficult to analyse. Is there any way that via spark UI etc we can analyse if data volume increased etc?
11-15-2021 10:42 AM
hi @Vibhor Sethi ,
You can check the Spark UI --> SQL sub tab. In this SQL tab you will be able to find the job duration, physical plan, query execution DAG. Docs here
You need to check the DAG and details of the query execution. It will give you the details on the table scan, number of rows, data size, etc. You need to compare this DAG details with the Spark job that run fine in the past and the new one that is stuck.
08-24-2023 07:45 AM
Hi @jose_gonzalez My problem is similar to this.
I have 5 interactive clusters running and 24 jobs running on each cluster after every 5 minutes, cannot use job clusters here due to start-up-time and it would miss 5 min SLA.
Jobs are running fine but sometimes at random some jobs from all the clusters stuck for some time and then continue to finish and we miss the 5 minute SLA for the next run, but our we have written logic so that it will execute both skip and current 5 minute window run in next single job run. It happens with all the jobs which are scheduled on different clusters.
My only issue is all jobs are running fine, why does it happen suddenly with all the clusters and then in next run is fine again. there is no information available in stderr, stdout and log4J logs. and it is so difficult to check in the log4j logs due concurrent 24 jobs are executing on the cluster.
Is there any way to check what is taking time behind the scenes when it is showing "command is submitted for execution" on notebook cells.
e.g: I have 2 cells in main notebook and each cell both notebook cells taking less than 1 minute to finish when it is running fine and it shows job duration as per the cells execution time on the right panel of the job run.
But sometime I observed that both notebook cells took same time to execute cells but it is showing job duration 5 minutes of the job instead of 1 minute which is the actual total execution time of both the cells.
And as for the DAG, it disappears after some time from spark UI, so how can we check in spark UI for old runs.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group