- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
2 weeks ago
Hey @NW1000 — good question, and your instinct to pre-compile and cache is the right one. There are three separate things working against you here, and fixing all three should collapse that 10-20 minute startup significantly.
- You're pulling source builds, not binaries.
Your PPM URL points at __linux__/noble/..., but Databricks Runtimes (14.x, 15.x, 17.x) run on Ubuntu 22.04 (jammy). When PPM can't match the distro, it silently falls back to compiling from source — that's almost certainly where most of your time is going. Switch to:
repos <- c(CRAN = "https://packagemanager.posit.co/cran/__linux__/jammy/latest")
Precompiled binaries should cut install time from minutes to seconds per package.
- Direct install to a Volume path isn't reliable.
install.packages(..., lib = "/Volumes/...") doesn't behave consistently across driver and workers. The documented pattern is two steps — install to the default library location first, then copy to the Volume:
# Run once on a "build" cluster with the same DBR you'll use in production
pkgs <- c("mmrm","emmeans","striprtf","pandoc",
"glmmTMB","kableExtra","rtables",
"tinytex","tern")
repos <- c(CRAN = "https://packagemanager.posit.co/cran/__linux__/jammy/latest")
install.packages(pkgs, repos = repos)
# Then copy to Volume
volume_pkgs <- "/Volumes/<catalog>/<schema>/<volume>/r_libs"
dir.create(volume_pkgs, recursive = TRUE, showWarnings = FALSE)
sapply(pkgs, function(p) {
file.copy(from = find.package(p),
to = volume_pkgs,
recursive = TRUE)
})
- Rprofile.site is the wrong config file.
Databricks controls R library paths through environment variables (R_LIBS_USER, etc.) set in /etc/R/Renviron.site. Changes in Rprofile.site can get overridden by the Databricks startup chain, which is likely why your packages didn't show up at runtime. Your init script should modify Renviron.site instead:
#!/bin/bash
set -euxo pipefail
VOLUME_PKGS="/Volumes/<catalog>/<schema>/<volume>/r_libs"
cat <<EOF >> /etc/R/Renviron.site
R_LIBS_USER=%U:/databricks/spark/R/lib:/local_disk0/.ephemeral_nfs/cluster_libraries/r:$VOLUME_PKGS
EOF
This ensures .libPaths() picks up your cached packages automatically when R starts.
Quick win for the PyPI packages: use the cluster Libraries tab directly (Compute → your cluster → Libraries → Install New → PyPI). Those get stored in the cluster-libraries location and don't need init scripts or notebooks.
Once you've made these changes, verify on a fresh R notebook:
.libPaths()
installed.packages()[, "Package"]
Confirm the Volume path shows up and your packages are found there. Also check the init script log on the driver to make sure it ran cleanly.
Worth flagging — the distro fix alone (jammy vs. noble) might solve 80% of this even without the caching layer. The caching on top of that should get your cold starts well under 2 minutes.
Hope that helps. Let us know how it goes.
Cheers, Louis.