topic Re: Shorten Classic Cluster start up time in Data Engineering

Shorten Classic Cluster start up time

NW1000 — Mon, 09 Mar 2026 20:46:06 GMT

We use R notebooks to generate workflow. Thus we have to use classic clusters. And we need roughly 10 additional R packages in addition to 2 pyPI packages. It takes at least 10-20 min to start the cluster. We found the most time taken were the package installation. I tried to pre-install the packages to a volume:

# Run this ONCE on a running cluster, then save the library path

lib_path <- "/Volumes/datalake/test/rlib_cache"

#dir.create(lib_path, recursive = TRUE, showWarnings = FALSE)

packages <- c("mmrm", "emmeans", "striprtf", "pandoc",

"glmmTMB", "kableExtra", "rtables",

"tinytex", "tern")

install.packages(packages,

lib = lib_path,

repos = c(CRAN = "https://packagemanager.posit.co/cran/__linux__/noble/2025-03-20"),

HTTPUserAgent = sprintf("R/%s R (%s)", getRversion(),

paste(getRversion(), R.version["platform"], R.version["arch"], R.version["os"])),

Ncpus = parallel::detectCores())

Then set up .sh as init script for the classic cluster:

#!/bin/bash

set -uo pipefail

exec > /tmp/init-r-libs.log 2>&1

echo "=== R Library Init started at $(date -u) ==="

CUSTOM_R_LIBS="/Volumes/datalake/test/rlib_cache"

# Use Rprofile.site — this runs AFTER Databricks sets up its R environment

# so the custom path will persist

cat <<EOF | sudo tee -a /usr/lib/R/etc/Rprofile.site

# --- Custom R Library Path (added by init script) ---

local({

custom_lib <- "${CUSTOM_R_LIBS}"

if (dir.exists(custom_lib)) {

.libPaths(c(custom_lib, .libPaths()))

}

})

EOF

echo "Custom R library path added to Rprofile.site: $CUSTOM_R_LIBS"

echo "=== R Library Init completed at $(date -u) ==="

But this cluster did not have the R packages installed. Failed to work.

Is there any way to shorten the cluster start up time? Thank you.

Re: Shorten Classic Cluster start up time

Kirankumarbs — Mon, 09 Mar 2026 21:05:11 GMT

The reason your Volume-based cache isn't working is a credential scoping issue. Databricks only injects UC Volume credentials into init scripts that are themselves stored on a UC Volume. If your init script lives in workspace files or cloud storage, it can't actually read from /Volumes/datalake/test/rlib_cache at execution time — even though the path looks fine and your R code works in a running notebook.

The fix: move your init script to the same Volume (e.g., /Volumes/datalake/test/scripts/init_r.sh). But I'd also change the strategy slightly. Instead of pointing .libPaths() at the Volume path, copy the packages to a local directory during init. Reading libraries over the FUSE mount adds noticeable latency on every library() call.

#!/bin/bash cp -R /Volumes/datalake/test/rlib_cache/* /usr/local/lib/R/site-library/ 2>/dev/null

That copy takes maybe 10-30 seconds for your package set. Way better than 20 minutes of compilation.

Hope this helps! If it helps, mark it as a solution!

Re: Shorten Classic Cluster start up time

Louis_Frolio — Wed, 11 Mar 2026 11:09:33 GMT

Hey @NW1000 — good question, and your instinct to pre-compile and cache is the right one. There are three separate things working against you here, and fixing all three should collapse that 10-20 minute startup significantly.

You're pulling source builds, not binaries.

Your PPM URL points at __linux__/noble/..., but Databricks Runtimes (14.x, 15.x, 17.x) run on Ubuntu 22.04 (jammy). When PPM can't match the distro, it silently falls back to compiling from source — that's almost certainly where most of your time is going. Switch to:

repos <- c(CRAN = "https://packagemanager.posit.co/cran/__linux__/jammy/latest")

Precompiled binaries should cut install time from minutes to seconds per package.

Direct install to a Volume path isn't reliable.

install.packages(..., lib = "/Volumes/...") doesn't behave consistently across driver and workers. The documented pattern is two steps — install to the default library location first, then copy to the Volume:

# Run once on a "build" cluster with the same DBR you'll use in production
pkgs <- c("mmrm","emmeans","striprtf","pandoc",
          "glmmTMB","kableExtra","rtables",
          "tinytex","tern")
repos <- c(CRAN = "https://packagemanager.posit.co/cran/__linux__/jammy/latest")
install.packages(pkgs, repos = repos)

# Then copy to Volume
volume_pkgs <- "/Volumes/<catalog>/<schema>/<volume>/r_libs"
dir.create(volume_pkgs, recursive = TRUE, showWarnings = FALSE)
sapply(pkgs, function(p) {
  file.copy(from = find.package(p),
            to   = volume_pkgs,
            recursive = TRUE)
})

Rprofile.site is the wrong config file.

Databricks controls R library paths through environment variables (R_LIBS_USER, etc.) set in /etc/R/Renviron.site. Changes in Rprofile.site can get overridden by the Databricks startup chain, which is likely why your packages didn't show up at runtime. Your init script should modify Renviron.site instead:

#!/bin/bash
set -euxo pipefail

VOLUME_PKGS="/Volumes/<catalog>/<schema>/<volume>/r_libs"

cat <<EOF >> /etc/R/Renviron.site
R_LIBS_USER=%U:/databricks/spark/R/lib:/local_disk0/.ephemeral_nfs/cluster_libraries/r:$VOLUME_PKGS
EOF

This ensures .libPaths() picks up your cached packages automatically when R starts.

Quick win for the PyPI packages: use the cluster Libraries tab directly (Compute → your cluster → Libraries → Install New → PyPI). Those get stored in the cluster-libraries location and don't need init scripts or notebooks.

Once you've made these changes, verify on a fresh R notebook:

.libPaths()
installed.packages()[, "Package"]

Confirm the Volume path shows up and your packages are found there. Also check the init script log on the driver to make sure it ran cleanly.

Worth flagging — the distro fix alone (jammy vs. noble) might solve 80% of this even without the caching layer. The caching on top of that should get your cold starts well under 2 minutes.

Hope that helps. Let us know how it goes.

Cheers, Louis.

Re: Shorten Classic Cluster start up time

NW1000 — Thu, 12 Mar 2026 01:47:51 GMT

Thanks a lot, Louis! I tried your method, saved the init script into another volume. But the cluster failed to start WITH error as: Init script failure:Cluster scoped init script /Volumes/datalake/test/utility_rlib_init_script/init-script-RLib.sh failed: Script exit status is non-zero. Why did fail?

I have another question: is the default CRAN package installation from the Library section of a classic cluster is https://databricks.packagemanager.posit.co/cran/__linux__/noble/2025-03-20/ ? Could we assume we can use https://databricks.packagemanager.posit.co/cran/__linux__/jammy/2025-03-20/?

Re: Shorten Classic Cluster start up time

Louis_Frolio — Thu, 12 Mar 2026 10:34:26 GMT

Hi @NW1000 ,

Glad you tried my suggestion, and thanks for sharing the details.

1. Why the init script failed

This message:

Init script failure: Cluster scoped init script ... failed: Script exit status is non-zero

really just means that something inside the bash script returned a non-zero exit code during cluster startup. In other words, the script hit an error and stopped.

The real clue will be in the init script log.

Here is where I would look:

Open the cluster details
Go to the Event Log or driver logs
Find the init script log file

For the script we were discussing, it should be something like:

/tmp/init-r-libs.log

Once you open that log, scroll to the bottom and look for the first real error message. That is usually where the root cause shows up.

In most cases, it tends to be one of these:

a typo in a path, such as the Volume path or script path
missing execute permissions on the script, for example:

chmod +x init-script-RLib.sh
an R command inside the script failing, such as install.packages() returning an error, which will cause the whole script to exit non-zero

Once you have the last few lines from that log, it should be much easier to pinpoint exactly what failed and tighten up the script accordingly.

2. About the default CRAN / Posit Package Manager URL

Yes — the URL you are seeing in the Libraries UI, something like:

https://databricks.packagemanager.posit.co/cran/__linux__/noble/2025-03-20/

is the Databricks-managed Posit Package Manager snapshot used by Databricks runtimes for R packages.

A few important things to know here:

Databricks pins R libraries to a specific CRAN snapshot, in this case 2025-03-20, so installs remain reproducible and stable
The __linux__/<codename>/2025-03-20 portion reflects the underlying Ubuntu release, such as jammy or noble
Databricks determines that automatically from the runtime OS for newer runtimes, including 17.x and above
That URL is intended to be used as the repos= value in install.packages(), not really as a browser-friendly page
So if you paste it into a browser and get something like “Invalid request,” that is not necessarily a problem — that can be expected behavior

If you want your own scripts to follow the same pattern across runtimes, the safest approach is to detect the OS codename dynamically and construct the URL from there, like this:

release <- system("lsb_release -c --short", intern = TRUE)
snapshot_date <- "2025-03-20"

options(
  HTTPUserAgent = sprintf(
    "R/%s R (%s)",
    getRversion(),
    paste(getRversion(), R.version["platform"], R.version["arch"], R.version["os"])
  ),
  repos = paste0(
    "https://databricks.packagemanager.posit.co/cran/__linux__/",
    release, "/", snapshot_date
  )
)

That way:

if the runtime is on jammy, it uses .../__linux__/jammy/2025-03-20/
if it is on noble, it uses .../__linux__/noble/2025-03-20/

That mirrors how Databricks handles the default CRAN configuration internally.

Hope this helps, Louis.

I can also make this a little shorter and more Community-post conversational if you want.

Re: Shorten Classic Cluster start up time

NW1000 — Thu, 12 Mar 2026 14:54:30 GMT

Hi Louis,

Thanks a lot for the great advice!

I used 17.3LTS ML Runtime for this classic cluster. With the code you gave, it showed "noble". Does it mean I should use 'noble' in the library installation?

Re: Shorten Classic Cluster start up time

RyanTImpe — Tue, 07 Apr 2026 19:57:34 GMT

Thanks for this detail so far. New to this thread but facing similar problem as @NW1000 .

I have a few big R dependencies (messy combo of Matrix, lme4, rstanarm). It used to take 10 minutes to install from jammy after starting a cluster but now taking 30 (haven't sorted out why though).

I followed your recs, copied the packages to Volumes, and then made the new init file for Renivorn.site. But the cluster fails, like it does with NW1000. Gemini is suggesting it's because the cluster doesn't have access Volumes at this point in the boot, but I'm still struggling to get it all fixed. I'm also struggling to find the log for the init in the Driver Logs.

Are there any other potential fixes I can explore?