Louis_Frolio
Databricks Employee
Databricks Employee

Hey @NW1000 — good question, and your instinct to pre-compile and cache is the right one. There are three separate things working against you here, and fixing all three should collapse that 10-20 minute startup significantly.

  1. You're pulling source builds, not binaries.

Your PPM URL points at __linux__/noble/..., but Databricks Runtimes (14.x, 15.x, 17.x) run on Ubuntu 22.04 (jammy). When PPM can't match the distro, it silently falls back to compiling from source — that's almost certainly where most of your time is going. Switch to:

repos <- c(CRAN = "https://packagemanager.posit.co/cran/__linux__/jammy/latest")

Precompiled binaries should cut install time from minutes to seconds per package.

  1. Direct install to a Volume path isn't reliable.

install.packages(..., lib = "/Volumes/...") doesn't behave consistently across driver and workers. The documented pattern is two steps — install to the default library location first, then copy to the Volume:

# Run once on a "build" cluster with the same DBR you'll use in production
pkgs <- c("mmrm","emmeans","striprtf","pandoc",
          "glmmTMB","kableExtra","rtables",
          "tinytex","tern")
repos <- c(CRAN = "https://packagemanager.posit.co/cran/__linux__/jammy/latest")
install.packages(pkgs, repos = repos)

# Then copy to Volume
volume_pkgs <- "/Volumes/<catalog>/<schema>/<volume>/r_libs"
dir.create(volume_pkgs, recursive = TRUE, showWarnings = FALSE)
sapply(pkgs, function(p) {
  file.copy(from = find.package(p),
            to   = volume_pkgs,
            recursive = TRUE)
})
  1. Rprofile.site is the wrong config file.

Databricks controls R library paths through environment variables (R_LIBS_USER, etc.) set in /etc/R/Renviron.site. Changes in Rprofile.site can get overridden by the Databricks startup chain, which is likely why your packages didn't show up at runtime. Your init script should modify Renviron.site instead:

#!/bin/bash
set -euxo pipefail

VOLUME_PKGS="/Volumes/<catalog>/<schema>/<volume>/r_libs"

cat <<EOF >> /etc/R/Renviron.site
R_LIBS_USER=%U:/databricks/spark/R/lib:/local_disk0/.ephemeral_nfs/cluster_libraries/r:$VOLUME_PKGS
EOF

This ensures .libPaths() picks up your cached packages automatically when R starts.

Quick win for the PyPI packages: use the cluster Libraries tab directly (Compute → your cluster → Libraries → Install New → PyPI). Those get stored in the cluster-libraries location and don't need init scripts or notebooks.

Once you've made these changes, verify on a fresh R notebook:

.libPaths()
installed.packages()[, "Package"]

Confirm the Volume path shows up and your packages are found there. Also check the init script log on the driver to make sure it ran cleanly.

Worth flagging — the distro fix alone (jammy vs. noble) might solve 80% of this even without the caching layer. The caching on top of that should get your cold starts well under 2 minutes.

Hope that helps. Let us know how it goes.

Cheers, Louis.