Databricks Community

JustinMills · ‎01-22-2018

No other output is available, not even output from cells that did run successfully.

Also, I'm unable to connect to spark ui or view the logs. It makes an attempt to load each of them, but after some time an error message appears saying it's unable to load.

This happens on a job that runs every Sunday. I've tried switching the cluster configuration a bit (spots, on-demand, # instances, instance types, etc) and nothing seems to fix it. The job runs two notebooks via the dbutils.notebook.run utility and I am able to run each notebook via a job independently, it's only when they are put together.

Any suggestions for figuring out what's going on? At this point, I'm thinking of breaking this up into two jobs and trying to stagger them far enough apart that the first is sure to finish before the second starts.

JustinMills · ‎01-25-2018

I don't have proof of this, but I suspect this is just a memory issue where spark either hangs (presumably stuck in GC), or is killed by the OS. I watched the logs as the job progressed and noticed that the GC cycle was happening more frequently as it approached where the job typically has died or hung. I re-ran the job using a larger instance size and it zipped right past where it had died/hung in the past. Of course the real test will be running this notebook as part of the larger job it typically runs with.

View solution in original post

JustinMills · ‎01-24-2018

I'm not sure if this is related, but I ran this job yesterday and the exact spot where it failed when run weekly this time never completed. It typically takes about 2 hours, but I noticed it still running at 14h. I canceled the job before I took a look at the spark ui/logs and now that the job is finished in a failed state, I am unable to load the Spark UI or view the logs, with the same error message above.

JustinMills · ‎01-24-2018

Seems like maybe there's something with how this job fails that circumvents Databrick's ability to restore the logs or UI. I remember in the past something like this happening, and it was related to a job outputting UTF8 characters. I think Databricks fixed that issue. This job should not do that, as it's a counting job and all text is pre-sanitized to only contain ASCII or be numeric ids.

JustinMills · ‎01-25-2018

I don't have proof of this, but I suspect this is just a memory issue where spark either hangs (presumably stuck in GC), or is killed by the OS. I watched the logs as the job progressed and noticed that the GC cycle was happening more frequently as it approached where the job typically has died or hung. I re-ran the job using a larger instance size and it zipped right past where it had died/hung in the past. Of course the real test will be running this notebook as part of the larger job it typically runs with.

Jingking · ‎07-06-2018

I ran into the same problem while writing a table of 100 columns and 2M rows into s3. I have tried using all possible "largest" drivers, but the problem persists.

AndrewMorris · ‎03-27-2019

I've been unable to find any background on this issue. After digging into the spark logs, I've also found a reference to a GC issue. More specifically:

java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.Arrays.copyOf(Arrays.java:3332) at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448) at java.lang.StringBuilder.append(StringBuilder.java:136) ..

I should note this is a simple object declaration. No data is being processed by the culprit cell.

lzlkni · ‎06-07-2021

most of the time it's out of memory on driver node. check over all the drive log, data node log in Spark UI.

And check if u r collecting huge data to drive node, e.g. collect()

Databricks Community

Job fails with "The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached."

Join Us as a Local Community Builder!

🚀 Weekly Delta (1 - 7 October): A Look Back at This Week’s Top Community Highlights!

🌟 Community Sparks of the Week | September 26 – October 2 🌟

Solution Accelerator Series | #4 - Toxicity Detection for Gaming

Level Up with Databricks Specialist Sessions

Announcing Data Intelligence for Cybersecurity