01-22-2018 06:55 AM
No other output is available, not even output from cells that did run successfully.
Also, I'm unable to connect to spark ui or view the logs. It makes an attempt to load each of them, but after some time an error message appears saying it's unable to load.
This happens on a job that runs every Sunday. I've tried switching the cluster configuration a bit (spots, on-demand, # instances, instance types, etc) and nothing seems to fix it. The job runs two notebooks via the dbutils.notebook.run utility and I am able to run each notebook via a job independently, it's only when they are put together.
Any suggestions for figuring out what's going on? At this point, I'm thinking of breaking this up into two jobs and trying to stagger them far enough apart that the first is sure to finish before the second starts.
01-25-2018 08:30 AM
I don't have proof of this, but I suspect this is just a memory issue where spark either hangs (presumably stuck in GC), or is killed by the OS. I watched the logs as the job progressed and noticed that the GC cycle was happening more frequently as it approached where the job typically has died or hung. I re-ran the job using a larger instance size and it zipped right past where it had died/hung in the past. Of course the real test will be running this notebook as part of the larger job it typically runs with.
01-24-2018 06:25 AM
I'm not sure if this is related, but I ran this job yesterday and the exact spot where it failed when run weekly this time never completed. It typically takes about 2 hours, but I noticed it still running at 14h. I canceled the job before I took a look at the spark ui/logs and now that the job is finished in a failed state, I am unable to load the Spark UI or view the logs, with the same error message above.
01-24-2018 08:16 AM
Seems like maybe there's something with how this job fails that circumvents Databrick's ability to restore the logs or UI. I remember in the past something like this happening, and it was related to a job outputting UTF8 characters. I think Databricks fixed that issue. This job should not do that, as it's a counting job and all text is pre-sanitized to only contain ASCII or be numeric ids.
01-25-2018 08:30 AM
I don't have proof of this, but I suspect this is just a memory issue where spark either hangs (presumably stuck in GC), or is killed by the OS. I watched the logs as the job progressed and noticed that the GC cycle was happening more frequently as it approached where the job typically has died or hung. I re-ran the job using a larger instance size and it zipped right past where it had died/hung in the past. Of course the real test will be running this notebook as part of the larger job it typically runs with.
07-06-2018 07:21 AM
I ran into the same problem while writing a table of 100 columns and 2M rows into s3. I have tried using all possible "largest" drivers, but the problem persists.
03-27-2019 08:02 AM
I've been unable to find any background on this issue. After digging into the spark logs, I've also found a reference to a GC issue. More specifically:
java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.Arrays.copyOf(Arrays.java:3332) at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448) at java.lang.StringBuilder.append(StringBuilder.java:136) ..I should note this is a simple object declaration. No data is being processed by the culprit cell.
06-07-2021 06:33 PM
most of the time it's out of memory on driver node. check over all the drive log, data node log in Spark UI.
And check if u r collecting huge data to drive node, e.g. collect()
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group