cancel
Showing results for 
Search instead for 
Did you mean: 
Community Platform Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results for 
Search instead for 
Did you mean: 

R Package Installation Best Practices

araiho
New Contributor II

Hello,

We are new to databricks and are wondering what the best practices are for R package installation. We currently have cluster spin up wait times of more than 20 minutes with our init scripts. We have tried the following:

1. Libraries tab in the cluster preferences
2. Docker container
3. Init script shown below 

Any help would be appreciated. We haven't been able to start development because of these wait times. 

Ann

 

#!/bin/bash

# Update package list and install system dependencies
apt-get update -qq && apt-get install -y -qq \
gdal-bin \
libgdal-dev \
libudunits2-dev \
libproj-dev \
libgeos-dev \
r-cran-covr \
r-cran-inline \
r-cran-pkgkitten \
r-cran-tinytest \
r-cran-xml2 \
r-cran-zoo

# Install R packages
R -e "install.packages('prism', repos='https://cloud.r-project.org/')"

# Install Python packages
/databricks/python3/bin/pip install cutadapt

# Install GDAL with pip
/databricks/python3/bin/pip install GDAL==3.2.2.1

# Print completion message
echo "Initialization script completed successfully."
2 REPLIES 2

araiho
New Contributor II

I wanted to add that with this script I cannot load prism or sf packages. I think there is something going on with the directories that gdal and proj are installed to. 

araiho
New Contributor II

@Retired_mod Thank you for your detailed response! I think we would like to use Docker if we can because we are not using RStudio but R directly in the databricks notebooks and workflows. So, anymore information about R and Docker and Databricks would also be useful. Currently, this docker code builds successfully and is archived successfully but is not deploying on Datatbricks. 

 

# syntax=docker/dockerfile:1.2

# Stage 1: Build R environment with Rocker
FROM --platform=linux/amd64 rocker/r-base:latest AS rbuilder

# Install required R packages in the Rocker image
RUN apt-get update && apt-get install -y \
    r-cran-dplyr \
    r-cran-ggplot2 \
    r-cran-tidyr \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

# Stage 2: Use Databricks image and copy R installation from Rocker
FROM --platform=linux/amd64 databricksruntime/standard:latest

# Copy R binaries and libraries from the Rocker image
COPY --from=rbuilder /usr/lib/R /usr/lib/R
COPY --from=rbuilder /usr/share/R /usr/share/R
COPY --from=rbuilder /etc/R /etc/R
COPY --from=rbuilder /usr/bin/R /usr/bin/R
COPY --from=rbuilder /usr/bin/Rscript /usr/bin/Rscript

# Ensure the R library paths are correctly set
ENV R_HOME=/usr/lib/R
ENV PATH=$PATH:/usr/lib/R/bin

# Copy R packages from the previous stage
COPY --from=rbuilder /usr/lib/R/site-library /usr/local/lib/R/site-library
COPY --from=rbuilder /usr/lib/x86_64-linux-gnu/ /usr/lib/x86_64-linux-gnu/

 



I have solved my dependency problem with the following code in my notebook, but I am a bit confused why it works because the PROJ_LIB has to be set to /usr/share/proj and then reset in the install of sf and prism to /lib/x86_64-linux-gnu and then the repo for sf has to be https://cran.r-project.org but could be https://packagemanager.rstudio.com/cran/__linux__/focal/latest for prism. I would like to use the second repo as much as possible to install R packages because it is much faster than CRAN. 

 

%r
system('sudo apt-get -y update && apt-get install -y  libudunits2-dev libgdal-dev libgeos-dev libproj-dev')

%sh
ldconfig -p | grep gdal
ldconfig -p | grep geos
ldconfig -p | grep proj

%r
options(HTTPUserAgent = sprintf(
  "R/%s R (%s)", 
  getRversion(), 
  paste(
    getRversion(), 
    R.version["platform"], 
    R.version["arch"], 
    R.version["os"]
  )
))

Sys.setenv(PROJ_LIB = "/usr/share/proj")

install.packages('units', lib='/databricks/spark/R/lib/',
  repos="https://cran.r-project.org")
install.packages('sf', 
  configure.args = "--with-proj-lib=/lib/x86_64-linux-gnu --with-proj-include=/usr/include",
  lib='/databricks/spark/R/lib/',
  repos="https://cran.r-project.org"
)

library(sf, lib.loc='/databricks/spark/R/lib/')
install.packages('prism', 
  configure.args = "--with-proj-lib=/lib/x86_64-linux-gnu --with-proj-include=/usr/include",
  lib='/databricks/spark/R/lib/',
  repos = c(CRAN = "https://packagemanager.rstudio.com/cran/__linux__/focal/latest")
)
library(prism, lib.loc='/databricks/spark/R/lib/')

 

Anyway! Thank you again for answering.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group