Databricks Community

rjurnitos · ‎02-17-2025

I am running Databricks 15.4 LTS on a single-node `n1-highmem-32` for a PySpark / GraphFrames app (not using builtin `graphframes` on ML image because we don't need a GPU) and I can start the cluster fine so long as libraries are not attached. I can then configure libraries: GraphFrames via Spark Packages using the Maven UI and our package `whl` and `requirements.txt` that I have uploaded to a volume. Everything works fine, I can use the cluster, import `from graphframes import GraphFrame` and all is well.

Then I stop the cluster. The Libraries are still configured as seen below.

Now I boot the cluster again. The cluster says it is done booting. The libraries spinner says complete. I try to attach and run a notebook... it will sit there forever. It will never attach. Finally there is this exception:

Failure starting repl. Try detaching and re-attaching the notebook. at com.databricks.spark.chauffeur.ExecContextState.processInternalMessage(ExecContextState.scala:347) at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)

This is a blocker for us, and seems like a bug.

What should I do about this? I am stuck. I can't automate this in a workflow because of this bug that requires manual intervention. We don't have Databricks support at this point, so I am here asking questions 🙂

rjurnitos · ‎02-24-2025

Bump... anyone?

mark_ott · 4 weeks ago

It sounds like you are encountering a cluster “hang”/notebook attach timeout after restarting a Databricks 15.4 LTS single-node cluster with custom libraries (including GraphFrames via Maven and additional .whl and requirements.txt dependencies). Your initial configuration works after a fresh attach, but after a restart, notebooks fail to attach with a persistent spinner and eventually the error:

Failure starting repl. Try detaching and re-attaching the notebook. at com.databricks.spark.chauffeur.ExecContextState.processInternalMessage...

Below are specific steps and mitigations you can try, plus direct advice that should allow you to either stabilize your workflow or gather evidence for a deeper investigation.

Possible Causes

Library conflicts: Custom .whl files or requirements.txt may pull packages that conflict with Databricks system dependencies, especially after cluster restart, due to library isolation and dependency resolution order.
Spark driver initialization hang: Your libraries may trigger code or resource loading that deadlocks Spark’s (or Python’s) driver environment, especially if dependencies or initialization logic have side effects or network calls.
Init script effects: Implicit or explicit init scripts (you may not be aware are running) can alter library paths or environment.
Stuck processes/ports: After restart, orphaned processes or locked ports could block the REPL startup.

Recommended Troubleshooting Steps

1. Check the “Driver” and “Init Script” Logs

Go to your cluster, select “Driver Logs”, and review output during library install and attach.
Search for errors around pip, Maven, Jar/egg loading, and any exceptions in “driver” / “eventlog” files.
If using cluster-scoped init scripts, ensure these aren’t hanging on network calls, package installs, etc.

2. Try Library Isolation and Reordering

Remove all libraries, restart cluster, then reattach one-by-one to isolate which library, if any, is causing the deadlock.
Try using “isolated” library installation (per-cluster rather than per-notebook, and avoid global installation mechanisms) if possible.

3. Use “Restart and Clear” Function

Use the cluster UI “Restart and Clear” (rather than a simple restart) to forcibly clear the Python process state and filesystem cache.
If this fixes the attach issue, it points to orphaned process or library cache corruption.

4. Use a Clean VM Image

If possible, switch the cluster node type or re-deploy the cluster from scratch. Sometimes, VM image cache or opaque environment bugs will persist across restarts but not on a fresh VM.

5. Minimal “Safe” Library Install

Only attach the GraphFrames Maven package first. If that works, add your .whl and requirements.txt files incrementally.
If the hang only appears after the custom .whl/requirements.txt step, examine that package for complex install/dependency logic (especially if it compiles or injects C modules, uses subprocesses, or has install-time scripts).

6. Consider “Restartless” Workflow

If this only occurs after a restart, and not for the fresh cluster, you may be able to work around this by always using “Terminate and Start” (not restart), or automating a notebook that reattaches/initializes as a post-start script.

Long-Term Workarounds & Automation

Use “Cluster init scripts” for library installation, ensuring these always complete quickly and log output for debugging.
Automate “detach and reattach” as part of workflow steps, if you cannot root-cause fix.
Keep library dependencies minimal and avoid system-wide Python or Java package overwrites unless essential.
Consider using Databricks Repos and Workspace-installed libraries rather than cluster-scope installation for stability.

What to Collect If Reporting Further

Driver and Executor logs after restart, especially errors around package load time.
Content and installation scripts of your custom .whl and requirements.txt, to isolate environmental issues.
List of all attached libraries (Maven coordinates, custom packages, etc.).
Cluster configuration details (init scripts, environment variables, runtime version, node type).

References and Further Reading

[Databricks: Troubleshoot cluster library installation issues]
[Databricks Community: Notebook/repl attach hung after restart]

This issue happens to others as well and is generally due to package conflicts or corruption of the working environment after a restart, especially with custom dependencies that are not fully compatible with the Databricks runtime’s pre-installed packages. Avoiding restarts (always terminating and starting new), or re-attaching libraries one at a time to find the culprit, are some practical ways forward until Databricks or your engineering team can provide a fully supported resolution.