cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Administration & Architecture
Explore discussions on Databricks administration, deployment strategies, and architectural best practices. Connect with administrators and architects to optimize your Databricks environment for performance, scalability, and security.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

GCP Cluster will not boot correctly with Libraries preconfigured - notebooks never attach

rjurnitos
New Contributor II

I am running Databricks 15.4 LTS on a single-node `n1-highmem-32` for a PySpark / GraphFrames app (not using builtin `graphframes` on ML image because we don't need a GPU) and I can start the cluster fine so long as libraries are not attached. I can then configure libraries: GraphFrames via Spark Packages using the Maven UI and our package `whl` and `requirements.txt` that I have uploaded to a volume. Everything works fine, I can use the cluster, import `from graphframes import GraphFrame` and all is well.

Then I stop the cluster. The Libraries are still configured as seen below.

rjurnitos_0-1739831664728.png

Now I boot the cluster again. The cluster says it is done booting. The libraries spinner says complete. I try to attach and run a notebook... it will sit there forever. It will never attach. Finally there is this exception:

 

Failure starting repl. Try detaching and re-attaching the notebook. at com.databricks.spark.chauffeur.ExecContextState.processInternalMessage(ExecContextState.scala:347) at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)

 

This is a blocker for us, and seems like a bug.

What should I do about this? I am stuck. I can't automate this in a workflow because of this bug that requires manual intervention. We don't have Databricks support at this point, so I am here asking questions ๐Ÿ™‚

2 REPLIES 2

rjurnitos
New Contributor II

Bump... anyone?

mark_ott
Databricks Employee
Databricks Employee

It sounds like you are encountering a cluster โ€œhangโ€/notebook attach timeout after restarting a Databricks 15.4 LTS single-node cluster with custom libraries (including GraphFrames via Maven and additional .whl and requirements.txt dependencies). Your initial configuration works after a fresh attach, but after a restart, notebooks fail to attach with a persistent spinner and eventually the error:

Failure starting repl. Try detaching and re-attaching the notebook. at com.databricks.spark.chauffeur.ExecContextState.processInternalMessage...

Below are specific steps and mitigations you can try, plus direct advice that should allow you to either stabilize your workflow or gather evidence for a deeper investigation.


Possible Causes

  • Library conflicts: Custom .whl files or requirements.txt may pull packages that conflict with Databricks system dependencies, especially after cluster restart, due to library isolation and dependency resolution order.

  • Spark driver initialization hang: Your libraries may trigger code or resource loading that deadlocks Sparkโ€™s (or Pythonโ€™s) driver environment, especially if dependencies or initialization logic have side effects or network calls.

  • Init script effects: Implicit or explicit init scripts (you may not be aware are running) can alter library paths or environment.

  • Stuck processes/ports: After restart, orphaned processes or locked ports could block the REPL startup.


Recommended Troubleshooting Steps

1. Check the โ€œDriverโ€ and โ€œInit Scriptโ€ Logs

  • Go to your cluster, select โ€œDriver Logsโ€, and review output during library install and attach.

  • Search for errors around pip, Maven, Jar/egg loading, and any exceptions in โ€œdriverโ€ / โ€œeventlogโ€ files.

  • If using cluster-scoped init scripts, ensure these arenโ€™t hanging on network calls, package installs, etc.

2. Try Library Isolation and Reordering

  • Remove all libraries, restart cluster, then reattach one-by-one to isolate which library, if any, is causing the deadlock.

  • Try using โ€œisolatedโ€ library installation (per-cluster rather than per-notebook, and avoid global installation mechanisms) if possible.

3. Use โ€œRestart and Clearโ€ Function

  • Use the cluster UI โ€œRestart and Clearโ€ (rather than a simple restart) to forcibly clear the Python process state and filesystem cache.

  • If this fixes the attach issue, it points to orphaned process or library cache corruption.

4. Use a Clean VM Image

  • If possible, switch the cluster node type or re-deploy the cluster from scratch. Sometimes, VM image cache or opaque environment bugs will persist across restarts but not on a fresh VM.

5. Minimal โ€œSafeโ€ Library Install

  • Only attach the GraphFrames Maven package first. If that works, add your .whl and requirements.txt files incrementally.

  • If the hang only appears after the custom .whl/requirements.txt step, examine that package for complex install/dependency logic (especially if it compiles or injects C modules, uses subprocesses, or has install-time scripts).

6. Consider โ€œRestartlessโ€ Workflow

  • If this only occurs after a restart, and not for the fresh cluster, you may be able to work around this by always using โ€œTerminate and Startโ€ (not restart), or automating a notebook that reattaches/initializes as a post-start script.


Long-Term Workarounds & Automation

  • Use โ€œCluster init scriptsโ€ for library installation, ensuring these always complete quickly and log output for debugging.

  • Automate โ€œdetach and reattachโ€ as part of workflow steps, if you cannot root-cause fix.

  • Keep library dependencies minimal and avoid system-wide Python or Java package overwrites unless essential.

  • Consider using Databricks Repos and Workspace-installed libraries rather than cluster-scope installation for stability.


What to Collect If Reporting Further

  • Driver and Executor logs after restart, especially errors around package load time.

  • Content and installation scripts of your custom .whl and requirements.txt, to isolate environmental issues.

  • List of all attached libraries (Maven coordinates, custom packages, etc.).

  • Cluster configuration details (init scripts, environment variables, runtime version, node type).


References and Further Reading

  • [Databricks: Troubleshoot cluster library installation issues]

  • [Databricks Community: Notebook/repl attach hung after restart]


This issue happens to others as well and is generally due to package conflicts or corruption of the working environment after a restart, especially with custom dependencies that are not fully compatible with the Databricks runtimeโ€™s pre-installed packages. Avoiding restarts (always terminating and starting new), or re-attaching libraries one at a time to find the culprit, are some practical ways forward until Databricks or your engineering team can provide a fully supported resolution.