Shorten Classic Cluster start up time

NW1000
New Contributor III

We use R notebooks to generate workflow. Thus we have to use classic clusters. And we need roughly 10 additional R packages in addition to 2 pyPI packages. It takes at least 10-20 min to start the cluster. We found the most time taken were the package installation. I tried to pre-install the packages to a volume: 

# Run this ONCE on a running cluster, then save the library path
lib_path <- "/Volumes/datalake/test/rlib_cache"

#dir.create(lib_path, recursive = TRUE, showWarnings = FALSE)

packages <- c("mmrm", "emmeans", "striprtf", "pandoc",
"glmmTMB", "kableExtra", "rtables",
"tinytex", "tern")

install.packages(packages,
lib = lib_path,
HTTPUserAgent = sprintf("R/%s R (%s)", getRversion(),
paste(getRversion(), R.version["platform"], R.version["arch"], R.version["os"])),
Ncpus = parallel::detectCores())
 
Then set up .sh as init script for the classic cluster: 
#!/bin/bash
set -uo pipefail
exec > /tmp/init-r-libs.log 2>&1
echo "=== R Library Init started at $(date -u) ==="

CUSTOM_R_LIBS="/Volumes/datalake/test/rlib_cache"

# Use Rprofile.site — this runs AFTER Databricks sets up its R environment
# so the custom path will persist
cat <<EOF | sudo tee -a /usr/lib/R/etc/Rprofile.site

# --- Custom R Library Path (added by init script) ---
local({
custom_lib <- "${CUSTOM_R_LIBS}"
if (dir.exists(custom_lib)) {
.libPaths(c(custom_lib, .libPaths()))
}
})
EOF

echo "Custom R library path added to Rprofile.site: $CUSTOM_R_LIBS"
echo "=== R Library Init completed at $(date -u) ==="
 
But this cluster did not have the R packages installed. Failed to work.

Is there any way to shorten the cluster start up time? Thank you.

Kirankumarbs
Contributor

The reason your Volume-based cache isn't working is a credential scoping issue. Databricks only injects UC Volume credentials into init scripts that are themselves stored on a UC Volume. If your init script lives in workspace files or cloud storage, it can't actually read from /Volumes/datalake/test/rlib_cache at execution time — even though the path looks fine and your R code works in a running notebook.

The fix: move your init script to the same Volume (e.g., /Volumes/datalake/test/scripts/init_r.sh). But I'd also change the strategy slightly. Instead of pointing .libPaths() at the Volume path, copy the packages to a local directory during init. Reading libraries over the FUSE mount adds noticeable latency on every library() call.

#!/bin/bash cp -R /Volumes/datalake/test/rlib_cache/* /usr/local/lib/R/site-library/ 2>/dev/null

 

That copy takes maybe 10-30 seconds for your package set. Way better than 20 minutes of compilation.

Hope this helps! If it helps, mark it as a solution!

Louis_Frolio
Databricks Employee
Databricks Employee

Hey @NW1000 — good question, and your instinct to pre-compile and cache is the right one. There are three separate things working against you here, and fixing all three should collapse that 10-20 minute startup significantly.

  1. You're pulling source builds, not binaries.

Your PPM URL points at __linux__/noble/..., but Databricks Runtimes (14.x, 15.x, 17.x) run on Ubuntu 22.04 (jammy). When PPM can't match the distro, it silently falls back to compiling from source — that's almost certainly where most of your time is going. Switch to:

repos <- c(CRAN = "https://packagemanager.posit.co/cran/__linux__/jammy/latest")

Precompiled binaries should cut install time from minutes to seconds per package.

  1. Direct install to a Volume path isn't reliable.

install.packages(..., lib = "/Volumes/...") doesn't behave consistently across driver and workers. The documented pattern is two steps — install to the default library location first, then copy to the Volume:

# Run once on a "build" cluster with the same DBR you'll use in production
pkgs <- c("mmrm","emmeans","striprtf","pandoc",
          "glmmTMB","kableExtra","rtables",
          "tinytex","tern")
repos <- c(CRAN = "https://packagemanager.posit.co/cran/__linux__/jammy/latest")
install.packages(pkgs, repos = repos)

# Then copy to Volume
volume_pkgs <- "/Volumes/<catalog>/<schema>/<volume>/r_libs"
dir.create(volume_pkgs, recursive = TRUE, showWarnings = FALSE)
sapply(pkgs, function(p) {
  file.copy(from = find.package(p),
            to   = volume_pkgs,
            recursive = TRUE)
})
  1. Rprofile.site is the wrong config file.

Databricks controls R library paths through environment variables (R_LIBS_USER, etc.) set in /etc/R/Renviron.site. Changes in Rprofile.site can get overridden by the Databricks startup chain, which is likely why your packages didn't show up at runtime. Your init script should modify Renviron.site instead:

#!/bin/bash
set -euxo pipefail

VOLUME_PKGS="/Volumes/<catalog>/<schema>/<volume>/r_libs"

cat <<EOF >> /etc/R/Renviron.site
R_LIBS_USER=%U:/databricks/spark/R/lib:/local_disk0/.ephemeral_nfs/cluster_libraries/r:$VOLUME_PKGS
EOF

This ensures .libPaths() picks up your cached packages automatically when R starts.

Quick win for the PyPI packages: use the cluster Libraries tab directly (Compute → your cluster → Libraries → Install New → PyPI). Those get stored in the cluster-libraries location and don't need init scripts or notebooks.

Once you've made these changes, verify on a fresh R notebook:

.libPaths()
installed.packages()[, "Package"]

Confirm the Volume path shows up and your packages are found there. Also check the init script log on the driver to make sure it ran cleanly.

Worth flagging — the distro fix alone (jammy vs. noble) might solve 80% of this even without the caching layer. The caching on top of that should get your cold starts well under 2 minutes.

Hope that helps. Let us know how it goes.

Cheers, Louis.

NW1000
New Contributor III

Thanks a lot, Louis! I tried your method, saved the init script into another volume. But the cluster failed to start WITH error as: Init script failure:Cluster scoped init script /Volumes/datalake/test/utility_rlib_init_script/init-script-RLib.sh failed: Script exit status is non-zero. Why did fail? 

I have another question: is the default CRAN package installation from the Library section of a classic cluster is https://databricks.packagemanager.posit.co/cran/__linux__/noble/2025-03-20/ ? Could we assume we can use https://databricks.packagemanager.posit.co/cran/__linux__/jammy/2025-03-20/?

Louis_Frolio
Databricks Employee
Databricks Employee

Hi @NW1000 ,

Glad you tried my suggestion, and thanks for sharing the details.

1. Why the init script failed

This message:

Init script failure: Cluster scoped init script ... failed: Script exit status is non-zero

really just means that something inside the bash script returned a non-zero exit code during cluster startup. In other words, the script hit an error and stopped.

The real clue will be in the init script log.

Here is where I would look:

  • Open the cluster details

  • Go to the Event Log or driver logs

  • Find the init script log file

For the script we were discussing, it should be something like:

/tmp/init-r-libs.log

Once you open that log, scroll to the bottom and look for the first real error message. That is usually where the root cause shows up.

In most cases, it tends to be one of these:

  • a typo in a path, such as the Volume path or script path

  • missing execute permissions on the script, for example:

    chmod +x init-script-RLib.sh

  • an R command inside the script failing, such as install.packages() returning an error, which will cause the whole script to exit non-zero

Once you have the last few lines from that log, it should be much easier to pinpoint exactly what failed and tighten up the script accordingly.

2. About the default CRAN / Posit Package Manager URL

Yes — the URL you are seeing in the Libraries UI, something like:

https://databricks.packagemanager.posit.co/cran/__linux__/noble/2025-03-20/

is the Databricks-managed Posit Package Manager snapshot used by Databricks runtimes for R packages.

A few important things to know here:

  • Databricks pins R libraries to a specific CRAN snapshot, in this case 2025-03-20, so installs remain reproducible and stable

  • The __linux__/<codename>/2025-03-20 portion reflects the underlying Ubuntu release, such as jammy or noble

  • Databricks determines that automatically from the runtime OS for newer runtimes, including 17.x and above

  • That URL is intended to be used as the repos= value in install.packages(), not really as a browser-friendly page

  • So if you paste it into a browser and get something like “Invalid request,” that is not necessarily a problem — that can be expected behavior

 

If you want your own scripts to follow the same pattern across runtimes, the safest approach is to detect the OS codename dynamically and construct the URL from there, like this:

release <- system("lsb_release -c --short", intern = TRUE)
snapshot_date <- "2025-03-20"

options(
  HTTPUserAgent = sprintf(
    "R/%s R (%s)",
    getRversion(),
    paste(getRversion(), R.version["platform"], R.version["arch"], R.version["os"])
  ),
  repos = paste0(
    "https://databricks.packagemanager.posit.co/cran/__linux__/",
    release, "/", snapshot_date
  )
)

That way:

  • if the runtime is on jammy, it uses .../__linux__/jammy/2025-03-20/

  • if it is on noble, it uses .../__linux__/noble/2025-03-20/

That mirrors how Databricks handles the default CRAN configuration internally.

Hope this helps, Louis.

 


 

I can also make this a little shorter and more Community-post conversational if you want.

NW1000
New Contributor III

Hi Louis,

Thanks a lot for the great advice! 

I used 17.3LTS ML Runtime for this classic cluster. With the code you gave, it showed "noble". Does it mean I should use 'noble' in the library installation? Screenshot 2026-03-12 at 10.49.51 AM.png