cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Job fails with "The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached."

JustinMills
New Contributor III

No other output is available, not even output from cells that did run successfully.

Also, I'm unable to connect to spark ui or view the logs. It makes an attempt to load each of them, but after some time an error message appears saying it's unable to load.

This happens on a job that runs every Sunday. I've tried switching the cluster configuration a bit (spots, on-demand, # instances, instance types, etc) and nothing seems to fix it. The job runs two notebooks via the dbutils.notebook.run utility and I am able to run each notebook via a job independently, it's only when they are put together.

Any suggestions for figuring out what's going on? At this point, I'm thinking of breaking this up into two jobs and trying to stagger them far enough apart that the first is sure to finish before the second starts.

1 ACCEPTED SOLUTION

Accepted Solutions

JustinMills
New Contributor III

I don't have proof of this, but I suspect this is just a memory issue where spark either hangs (presumably stuck in GC), or is killed by the OS. I watched the logs as the job progressed and noticed that the GC cycle was happening more frequently as it approached where the job typically has died or hung. I re-ran the job using a larger instance size and it zipped right past where it had died/hung in the past. Of course the real test will be running this notebook as part of the larger job it typically runs with.

View solution in original post

6 REPLIES 6

JustinMills
New Contributor III

I'm not sure if this is related, but I ran this job yesterday and the exact spot where it failed when run weekly this time never completed. It typically takes about 2 hours, but I noticed it still running at 14h. I canceled the job before I took a look at the spark ui/logs and now that the job is finished in a failed state, I am unable to load the Spark UI or view the logs, with the same error message above.

JustinMills
New Contributor III

Seems like maybe there's something with how this job fails that circumvents Databrick's ability to restore the logs or UI. I remember in the past something like this happening, and it was related to a job outputting UTF8 characters. I think Databricks fixed that issue. This job should not do that, as it's a counting job and all text is pre-sanitized to only contain ASCII or be numeric ids.

JustinMills
New Contributor III

I don't have proof of this, but I suspect this is just a memory issue where spark either hangs (presumably stuck in GC), or is killed by the OS. I watched the logs as the job progressed and noticed that the GC cycle was happening more frequently as it approached where the job typically has died or hung. I re-ran the job using a larger instance size and it zipped right past where it had died/hung in the past. Of course the real test will be running this notebook as part of the larger job it typically runs with.

Jingking
New Contributor II

I ran into the same problem while writing a table of 100 columns and 2M rows into s3. I have tried using all possible "largest" drivers, but the problem persists.

I've been unable to find any background on this issue. After digging into the spark logs, I've also found a reference to a GC issue. More specifically:

java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.Arrays.copyOf(Arrays.java:3332) at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448) at java.lang.StringBuilder.append(StringBuilder.java:136) ..

I should note this is a simple object declaration. No data is being processed by the culprit cell.

lzlkni
New Contributor II

most of the time it's out of memory on driver node. check over all the drive log, data node log in Spark UI.

And check if u r collecting huge data to drive node, e.g. collect()

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!