โ08-04-2016 10:49 AM
I created some ETL using DataFrames in python. It used to run 180 sec. But it is not taking ~ 1200 sec. I have been changing it, so it could be something that I introduced, or something in the environment.
Part of the process is appending results into a file on S3.
I a looking at Apache Jobs and I cannot see that any of them is active.
While I was writing this, I got: org.apache.spark.SparkException: Job aborted.
Command took 1274.63s -- by xxxxxxxx@gmail.com
at 8/4/2016, 12:44:17 PM on def4 (150 GB)
I have attached output that I got:
I assume that I should be able to see in Spark UI what is active. I was surprised that Active Tasks on all executors was 0. Should I look at something else?
I tried to restart the cluster, but it was the same before and after. I used the same version of Spark 1.6.2 (Hadoop 2).
โ08-04-2016 11:43 AM
While I waiting for some response (I had lunch and then) I decided to do something else on this notebook, so I cloned it...
I have some initialization code in the notebook. It was taking 60 sec before and after cloning 1.4 sec. Wow!
Did you (Databricks support) do something on the cluster?
I am going to run my etl command.
It was running very fast and then it got "stuck" again. I do not see any Spark job running.
โ08-04-2016 12:35 PM
In the meanwhile I got an idea to look into driver log. I've found this:
2016-08-04T19:19:57.980+0000: [GC (Allocation Failure) [PSYoungGen: 6827008K->52511K(7299584K)] 7660819K->886330K(22848000K), 0.0142959 secs] [Times: user=0.08 sys=0.01, real=0.01 secs]
...
04T19:27:03.294+0000: [GC (Allocation Failure) [PSYoungGen: 7270001K->134234K(7454208K)] 8103861K->968093K(23002624K), 0.0509207 secs] [Times: user=0.33 sys=0.00, real=0.05 secs]
โ08-04-2016 01:02 PM
the process finally finished after 3600 sec (3x slower then long duration that i was complaining about).
โ08-05-2016 01:16 PM
Today at some point I created new cluster again.
Suddenly everything got much faster. It is back to 270 - 330 sec.
My question still stands - how do I know what is server doing/why is it slow/stuck?
btw, how long does it take to moderate question?
โ03-04-2019 10:36 PM
Was this issue resolved? I'm also getting the same problem on my spark cluster.
โ10-23-2019 09:38 AM
I have a similar issue. Several times per week I experience very slow (5 minutes +) of "running command" on a cell that should take sub 1 second to execute. It usually solves the problem to restart the cluster, but still a major inconvenience.
โ12-06-2019 09:13 AM
Check for GC (garbage collection) errors in standard out for the cluster.
https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html
โ01-14-2020 10:04 AM
I am getting this same issue. Occasionally a cell will display "Running Command" for as long as an hour. This can happen even for simple commands that ordinarily run in less than a second. I have tried restarting the cluster, attaching to a different cluster. Nothing seems to help.
โ04-19-2020 06:19 PM
Hi,
Facing same issue. Does anyone found the solution?
โ05-19-2020 09:21 AM
Mm, probably yes
โ04-28-2022 09:09 AM
I am having a problem very similar.
Since yesterday, without a known reason, some commands that used to run daily are now stuck in a "Running command" state. Commands like:
dataframe.show(n=1)
dataframe.toPandas()
dataframe.description()
dataframe.write.format("csv").save(location)
are now stuck also for quite small dataframes with 28 rows and 5 columns, for example. I would appreciate any help since the problem is also in important daily jobs.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group