Shorten Classic Cluster start up time
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
2 weeks ago
We use R notebooks to generate workflow. Thus we have to use classic clusters. And we need roughly 10 additional R packages in addition to 2 pyPI packages. It takes at least 10-20 min to start the cluster. We found the most time taken were the package installation. I tried to pre-install the packages to a volume:
Is there any way to shorten the cluster start up time? Thank you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
2 weeks ago
The reason your Volume-based cache isn't working is a credential scoping issue. Databricks only injects UC Volume credentials into init scripts that are themselves stored on a UC Volume. If your init script lives in workspace files or cloud storage, it can't actually read from /Volumes/datalake/test/rlib_cache at execution time — even though the path looks fine and your R code works in a running notebook.
The fix: move your init script to the same Volume (e.g., /Volumes/datalake/test/scripts/init_r.sh). But I'd also change the strategy slightly. Instead of pointing .libPaths() at the Volume path, copy the packages to a local directory during init. Reading libraries over the FUSE mount adds noticeable latency on every library() call.
#!/bin/bash cp -R /Volumes/datalake/test/rlib_cache/* /usr/local/lib/R/site-library/ 2>/dev/null
That copy takes maybe 10-30 seconds for your package set. Way better than 20 minutes of compilation.
Hope this helps! If it helps, mark it as a solution!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
2 weeks ago
Hey @NW1000 — good question, and your instinct to pre-compile and cache is the right one. There are three separate things working against you here, and fixing all three should collapse that 10-20 minute startup significantly.
- You're pulling source builds, not binaries.
Your PPM URL points at __linux__/noble/..., but Databricks Runtimes (14.x, 15.x, 17.x) run on Ubuntu 22.04 (jammy). When PPM can't match the distro, it silently falls back to compiling from source — that's almost certainly where most of your time is going. Switch to:
repos <- c(CRAN = "https://packagemanager.posit.co/cran/__linux__/jammy/latest")
Precompiled binaries should cut install time from minutes to seconds per package.
- Direct install to a Volume path isn't reliable.
install.packages(..., lib = "/Volumes/...") doesn't behave consistently across driver and workers. The documented pattern is two steps — install to the default library location first, then copy to the Volume:
# Run once on a "build" cluster with the same DBR you'll use in production
pkgs <- c("mmrm","emmeans","striprtf","pandoc",
"glmmTMB","kableExtra","rtables",
"tinytex","tern")
repos <- c(CRAN = "https://packagemanager.posit.co/cran/__linux__/jammy/latest")
install.packages(pkgs, repos = repos)
# Then copy to Volume
volume_pkgs <- "/Volumes/<catalog>/<schema>/<volume>/r_libs"
dir.create(volume_pkgs, recursive = TRUE, showWarnings = FALSE)
sapply(pkgs, function(p) {
file.copy(from = find.package(p),
to = volume_pkgs,
recursive = TRUE)
})
- Rprofile.site is the wrong config file.
Databricks controls R library paths through environment variables (R_LIBS_USER, etc.) set in /etc/R/Renviron.site. Changes in Rprofile.site can get overridden by the Databricks startup chain, which is likely why your packages didn't show up at runtime. Your init script should modify Renviron.site instead:
#!/bin/bash
set -euxo pipefail
VOLUME_PKGS="/Volumes/<catalog>/<schema>/<volume>/r_libs"
cat <<EOF >> /etc/R/Renviron.site
R_LIBS_USER=%U:/databricks/spark/R/lib:/local_disk0/.ephemeral_nfs/cluster_libraries/r:$VOLUME_PKGS
EOF
This ensures .libPaths() picks up your cached packages automatically when R starts.
Quick win for the PyPI packages: use the cluster Libraries tab directly (Compute → your cluster → Libraries → Install New → PyPI). Those get stored in the cluster-libraries location and don't need init scripts or notebooks.
Once you've made these changes, verify on a fresh R notebook:
.libPaths()
installed.packages()[, "Package"]
Confirm the Volume path shows up and your packages are found there. Also check the init script log on the driver to make sure it ran cleanly.
Worth flagging — the distro fix alone (jammy vs. noble) might solve 80% of this even without the caching layer. The caching on top of that should get your cold starts well under 2 minutes.
Hope that helps. Let us know how it goes.
Cheers, Louis.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
2 weeks ago
Thanks a lot, Louis! I tried your method, saved the init script into another volume. But the cluster failed to start WITH error as: Init script failure:Cluster scoped init script /Volumes/datalake/test/utility_rlib_init_script/init-script-RLib.sh failed: Script exit status is non-zero. Why did fail?
I have another question: is the default CRAN package installation from the Library section of a classic cluster is https://databricks.packagemanager.posit.co/cran/__linux__/noble/2025-03-20/ ? Could we assume we can use https://databricks.packagemanager.posit.co/cran/__linux__/jammy/2025-03-20/?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
2 weeks ago
Hi @NW1000 ,
Glad you tried my suggestion, and thanks for sharing the details.
1. Why the init script failed
This message:
Init script failure: Cluster scoped init script ... failed: Script exit status is non-zero
really just means that something inside the bash script returned a non-zero exit code during cluster startup. In other words, the script hit an error and stopped.
The real clue will be in the init script log.
Here is where I would look:
-
Open the cluster details
-
Go to the Event Log or driver logs
-
Find the init script log file
For the script we were discussing, it should be something like:
/tmp/init-r-libs.log
Once you open that log, scroll to the bottom and look for the first real error message. That is usually where the root cause shows up.
In most cases, it tends to be one of these:
-
a typo in a path, such as the Volume path or script path
-
missing execute permissions on the script, for example:
chmod +x init-script-RLib.sh
-
an R command inside the script failing, such as install.packages() returning an error, which will cause the whole script to exit non-zero
Once you have the last few lines from that log, it should be much easier to pinpoint exactly what failed and tighten up the script accordingly.
2. About the default CRAN / Posit Package Manager URL
Yes — the URL you are seeing in the Libraries UI, something like:
https://databricks.packagemanager.posit.co/cran/__linux__/noble/2025-03-20/
is the Databricks-managed Posit Package Manager snapshot used by Databricks runtimes for R packages.
A few important things to know here:
-
Databricks pins R libraries to a specific CRAN snapshot, in this case 2025-03-20, so installs remain reproducible and stable
-
The __linux__/<codename>/2025-03-20 portion reflects the underlying Ubuntu release, such as jammy or noble
-
Databricks determines that automatically from the runtime OS for newer runtimes, including 17.x and above
-
That URL is intended to be used as the repos= value in install.packages(), not really as a browser-friendly page
-
So if you paste it into a browser and get something like “Invalid request,” that is not necessarily a problem — that can be expected behavior
If you want your own scripts to follow the same pattern across runtimes, the safest approach is to detect the OS codename dynamically and construct the URL from there, like this:
release <- system("lsb_release -c --short", intern = TRUE)
snapshot_date <- "2025-03-20"
options(
HTTPUserAgent = sprintf(
"R/%s R (%s)",
getRversion(),
paste(getRversion(), R.version["platform"], R.version["arch"], R.version["os"])
),
repos = paste0(
"https://databricks.packagemanager.posit.co/cran/__linux__/",
release, "/", snapshot_date
)
)
That way:
-
if the runtime is on jammy, it uses .../__linux__/jammy/2025-03-20/
-
if it is on noble, it uses .../__linux__/noble/2025-03-20/
That mirrors how Databricks handles the default CRAN configuration internally.
Hope this helps, Louis.
I can also make this a little shorter and more Community-post conversational if you want.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
2 weeks ago
Hi Louis,
Thanks a lot for the great advice!
I used 17.3LTS ML Runtime for this classic cluster. With the code you gave, it showed "noble". Does it mean I should use 'noble' in the library installation?