cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

R Package Installation Best Practices

araiho
New Contributor II

Hello,

We are new to databricks and are wondering what the best practices are for R package installation. We currently have cluster spin up wait times of more than 20 minutes with our init scripts. We have tried the following:

1. Libraries tab in the cluster preferences
2. Docker container
3. Init script shown below 

Any help would be appreciated. We haven't been able to start development because of these wait times. 

Ann

 

#!/bin/bash

# Update package list and install system dependencies
apt-get update -qq && apt-get install -y -qq \
gdal-bin \
libgdal-dev \
libudunits2-dev \
libproj-dev \
libgeos-dev \
r-cran-covr \
r-cran-inline \
r-cran-pkgkitten \
r-cran-tinytest \
r-cran-xml2 \
r-cran-zoo

# Install R packages
R -e "install.packages('prism', repos='https://cloud.r-project.org/')"

# Install Python packages
/databricks/python3/bin/pip install cutadapt

# Install GDAL with pip
/databricks/python3/bin/pip install GDAL==3.2.2.1

# Print completion message
echo "Initialization script completed successfully."
2 REPLIES 2

araiho
New Contributor II

I wanted to add that with this script I cannot load prism or sf packages. I think there is something going on with the directories that gdal and proj are installed to. 

araiho
New Contributor II

@Retired_mod Thank you for your detailed response! I think we would like to use Docker if we can because we are not using RStudio but R directly in the databricks notebooks and workflows. So, anymore information about R and Docker and Databricks would also be useful. Currently, this docker code builds successfully and is archived successfully but is not deploying on Datatbricks. 

 

# syntax=docker/dockerfile:1.2

# Stage 1: Build R environment with Rocker
FROM --platform=linux/amd64 rocker/r-base:latest AS rbuilder

# Install required R packages in the Rocker image
RUN apt-get update && apt-get install -y \
    r-cran-dplyr \
    r-cran-ggplot2 \
    r-cran-tidyr \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

# Stage 2: Use Databricks image and copy R installation from Rocker
FROM --platform=linux/amd64 databricksruntime/standard:latest

# Copy R binaries and libraries from the Rocker image
COPY --from=rbuilder /usr/lib/R /usr/lib/R
COPY --from=rbuilder /usr/share/R /usr/share/R
COPY --from=rbuilder /etc/R /etc/R
COPY --from=rbuilder /usr/bin/R /usr/bin/R
COPY --from=rbuilder /usr/bin/Rscript /usr/bin/Rscript

# Ensure the R library paths are correctly set
ENV R_HOME=/usr/lib/R
ENV PATH=$PATH:/usr/lib/R/bin

# Copy R packages from the previous stage
COPY --from=rbuilder /usr/lib/R/site-library /usr/local/lib/R/site-library
COPY --from=rbuilder /usr/lib/x86_64-linux-gnu/ /usr/lib/x86_64-linux-gnu/

 



I have solved my dependency problem with the following code in my notebook, but I am a bit confused why it works because the PROJ_LIB has to be set to /usr/share/proj and then reset in the install of sf and prism to /lib/x86_64-linux-gnu and then the repo for sf has to be https://cran.r-project.org but could be https://packagemanager.rstudio.com/cran/__linux__/focal/latest for prism. I would like to use the second repo as much as possible to install R packages because it is much faster than CRAN. 

 

%r
system('sudo apt-get -y update && apt-get install -y  libudunits2-dev libgdal-dev libgeos-dev libproj-dev')

%sh
ldconfig -p | grep gdal
ldconfig -p | grep geos
ldconfig -p | grep proj

%r
options(HTTPUserAgent = sprintf(
  "R/%s R (%s)", 
  getRversion(), 
  paste(
    getRversion(), 
    R.version["platform"], 
    R.version["arch"], 
    R.version["os"]
  )
))

Sys.setenv(PROJ_LIB = "/usr/share/proj")

install.packages('units', lib='/databricks/spark/R/lib/',
  repos="https://cran.r-project.org")
install.packages('sf', 
  configure.args = "--with-proj-lib=/lib/x86_64-linux-gnu --with-proj-include=/usr/include",
  lib='/databricks/spark/R/lib/',
  repos="https://cran.r-project.org"
)

library(sf, lib.loc='/databricks/spark/R/lib/')
install.packages('prism', 
  configure.args = "--with-proj-lib=/lib/x86_64-linux-gnu --with-proj-include=/usr/include",
  lib='/databricks/spark/R/lib/',
  repos = c(CRAN = "https://packagemanager.rstudio.com/cran/__linux__/focal/latest")
)
library(prism, lib.loc='/databricks/spark/R/lib/')

 

Anyway! Thank you again for answering.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now