cancel
Showing results for 
Search instead for 
Did you mean: 
Community Platform Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results for 
Search instead for 
Did you mean: 

R Package Installation Best Practices

araiho
New Contributor II

Hello,

We are new to databricks and are wondering what the best practices are for R package installation. We currently have cluster spin up wait times of more than 20 minutes with our init scripts. We have tried the following:

1. Libraries tab in the cluster preferences
2. Docker container
3. Init script shown below 

Any help would be appreciated. We haven't been able to start development because of these wait times. 

Ann

 

#!/bin/bash

# Update package list and install system dependencies
apt-get update -qq && apt-get install -y -qq \
gdal-bin \
libgdal-dev \
libudunits2-dev \
libproj-dev \
libgeos-dev \
r-cran-covr \
r-cran-inline \
r-cran-pkgkitten \
r-cran-tinytest \
r-cran-xml2 \
r-cran-zoo

# Install R packages
R -e "install.packages('prism', repos='https://cloud.r-project.org/')"

# Install Python packages
/databricks/python3/bin/pip install cutadapt

# Install GDAL with pip
/databricks/python3/bin/pip install GDAL==3.2.2.1

# Print completion message
echo "Initialization script completed successfully."
3 REPLIES 3

araiho
New Contributor II

I wanted to add that with this script I cannot load prism or sf packages. I think there is something going on with the directories that gdal and proj are installed to. 

Hi @araihoWhen it comes to R package installation in Databricks, there are a few best practices you can follow to optimize performance and reduce cluster spin-up times:

  1. At first, explicitly set the installation directory to /databricks/spark/R/lib.
  2. Build your custom R package either from the command line or using RStudio. Copy the custom package file from your development machine to your Databricks workspace. Install the custom package into a library by running install.packages().
  3. If you’re using Databricks Connect, ensure that you have the necessary prerequisites installed. For example, you can specify the following packages: sparklyr, pysparklyr, reticulate, usethis, dplyr, dbplyr
  4. Make sure to select “Install dependencies” when setting up Databricks Connect.

Thank you for the additional information!

It seems like there might be an issue with the installation directories for the gdal and proj packages. 

  1. Could you please re-check that the gdal and proj packages are installed in the correct directories?
  2. Ensure that the required dependencies for prism and sf are correctly installed. These packages often rely on other libraries. For example, sf depends on gdal, proj, and units. Make sure these dependencies are available.
  3. Verify that your Databricks cluster is configured correctly. Sometimes, cluster-specific settings can affect package installation. Check if there are any custom configurations or restrictions related to package installation.
  4. If you suspect issues with the default library paths, consider specifying a custom library path for your R packages.
  5. After making any changes, restart your cluster to ensure that the modifications take effect.

Let me know if you need further assistance. 

araiho
New Contributor II

@Kaniz_Fatma Thank you for your detailed response! I think we would like to use Docker if we can because we are not using RStudio but R directly in the databricks notebooks and workflows. So, anymore information about R and Docker and Databricks would also be useful. Currently, this docker code builds successfully and is archived successfully but is not deploying on Datatbricks. 

 

# syntax=docker/dockerfile:1.2

# Stage 1: Build R environment with Rocker
FROM --platform=linux/amd64 rocker/r-base:latest AS rbuilder

# Install required R packages in the Rocker image
RUN apt-get update && apt-get install -y \
    r-cran-dplyr \
    r-cran-ggplot2 \
    r-cran-tidyr \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

# Stage 2: Use Databricks image and copy R installation from Rocker
FROM --platform=linux/amd64 databricksruntime/standard:latest

# Copy R binaries and libraries from the Rocker image
COPY --from=rbuilder /usr/lib/R /usr/lib/R
COPY --from=rbuilder /usr/share/R /usr/share/R
COPY --from=rbuilder /etc/R /etc/R
COPY --from=rbuilder /usr/bin/R /usr/bin/R
COPY --from=rbuilder /usr/bin/Rscript /usr/bin/Rscript

# Ensure the R library paths are correctly set
ENV R_HOME=/usr/lib/R
ENV PATH=$PATH:/usr/lib/R/bin

# Copy R packages from the previous stage
COPY --from=rbuilder /usr/lib/R/site-library /usr/local/lib/R/site-library
COPY --from=rbuilder /usr/lib/x86_64-linux-gnu/ /usr/lib/x86_64-linux-gnu/

 



I have solved my dependency problem with the following code in my notebook, but I am a bit confused why it works because the PROJ_LIB has to be set to /usr/share/proj and then reset in the install of sf and prism to /lib/x86_64-linux-gnu and then the repo for sf has to be https://cran.r-project.org but could be https://packagemanager.rstudio.com/cran/__linux__/focal/latest for prism. I would like to use the second repo as much as possible to install R packages because it is much faster than CRAN. 

 

%r
system('sudo apt-get -y update && apt-get install -y  libudunits2-dev libgdal-dev libgeos-dev libproj-dev')

%sh
ldconfig -p | grep gdal
ldconfig -p | grep geos
ldconfig -p | grep proj

%r
options(HTTPUserAgent = sprintf(
  "R/%s R (%s)", 
  getRversion(), 
  paste(
    getRversion(), 
    R.version["platform"], 
    R.version["arch"], 
    R.version["os"]
  )
))

Sys.setenv(PROJ_LIB = "/usr/share/proj")

install.packages('units', lib='/databricks/spark/R/lib/',
  repos="https://cran.r-project.org")
install.packages('sf', 
  configure.args = "--with-proj-lib=/lib/x86_64-linux-gnu --with-proj-include=/usr/include",
  lib='/databricks/spark/R/lib/',
  repos="https://cran.r-project.org"
)

library(sf, lib.loc='/databricks/spark/R/lib/')
install.packages('prism', 
  configure.args = "--with-proj-lib=/lib/x86_64-linux-gnu --with-proj-include=/usr/include",
  lib='/databricks/spark/R/lib/',
  repos = c(CRAN = "https://packagemanager.rstudio.com/cran/__linux__/focal/latest")
)
library(prism, lib.loc='/databricks/spark/R/lib/')

 

Anyway! Thank you again for answering.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group