03-02-2023 08:00 PM
I'm trying to use dockerfile to create a cluster which has Robyn (https://facebookexperimental.github.io/Robyn/) and other R libraries installed. But it is failing to install the R libraries to the cluster. When I run the container in interactive mode, I can see R libraries.
How can I use dockerfile to create cluster with these R libraries installed on the cluster?
Thank you
Attachments:
03-05-2023 10:24 PM
Hi, The error looks like it is not able to locate one package, could you please reverify if the package name and the address to the package is valid?
Also please tag @Debayan Mukherjee with your next response which will notify me, Thank you!
03-06-2023 05:08 AM
@Debayan Mukherjee Thank you for your reply. Package name and the address is valid. I can see the package version when I run container in interactive mode. But none the R packages are getting installed on cluster when I use docker image to create the cluster. Am I missing some code in dockerfile?
03-08-2023 08:37 AM
@Debayan Mukherjee Hi, wanted to follow up on this. Please let me know if you need any more information from my side.
03-08-2023 10:06 PM
Hi @Navneet Sonak Sorry for the dela!, we would also like know how the docker image was created? There can be a possibility something is missing the docker image code. Also, is it working with the default DBR cluster?
03-09-2023 06:18 AM
Hi @Debayan Mukherjee docker image is created using an argo workflow. I used this dockerfile as reference: https://github.com/databricks/containers/blob/master/ubuntu/R/Dockerfile. I'm not sure I follow you 2nd question. Cluster is getting created fine, it is that they are missing all the R packages which should get installed on them bc of dockerfile.
Here's my dockerfile code:
FROM databricksruntime/standard:10.4-LTS
# Suppress interactive configuration prompts
ENV DEBIAN_FRONTEND=noninteractive
ENV DOWNLOAD_STATIC_LIBV8=1
ENV TZ=America/New_York
# install dependencies
RUN apt-get update \
&& apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9 \
&& add-apt-repository -y 'deb https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/' \
&& apt-get install build-essential --yes \
dirmngr gnupg apt-transport-https ca-certificates software-properties-common \
autoconf \
automake \
g++ \
gcc \
cmake \
gfortran \
make \
nano \
liblapack-dev \
liblapack3 \
libopenblas-base \
libopenblas-dev \
libcurl4-openssl-dev\
libxml2-dev\
libssl-dev\
libnlopt-dev \
r-base \
r-base-dev \
&& apt-get clean all \
&& rm -rf /var/lib/apt/lists/*
RUN R -e "install.packages(c('remotes', 'shiny'), repos='https://cran.microsoft.com/')"
#RUN R -e "remotes::install_github('facebookexperimental/Robyn/R');"
RUN R -e "install.packages('Robyn')"
RUN R -e "library(Robyn)"
# # DBI/ODBC dependencies
RUN R -e "install.packages(c('DBI', 'dplyr','dbplyr','odbc'), repos='https://cran.microsoft.com/')"
# # Databricks dependencies
# # hwriterPlus is used by Databricks to display output in notebook cells
# # Rserve allows Spark to communicate with a local R process to run R code
RUN R -e "install.packages(c('hwriterPlus'), repos='https://mran.revolutionanalytics.com/snapshot/2017-02-26')"
RUN R -e "install.packages(c('htmltools'), repos='https://cran.microsoft.com/')"
RUN R -e "install.packages('Rserve', repos='http://rforge.net/')"
RUN R -e "install.packages('reticulate');"
RUN R -e "library(reticulate)"
# ## Install Nevergrad
# # RUN R -e "reticulate::use_python('/opt/conda/bin/python3')"
# # RUN R -e "reticulate::py_config()"
# # RUN R -e "reticulate::py_install('nevergrad', pip = TRUE)"
RUN /databricks/python3/bin/pip install nevergrad
03-12-2023 11:03 PM
We need to check the docker file code and proceed, it would be helpful if you create a support case for the same which will ensure to get tagged with the right team.
Also, is there any dependency package failures?
03-31-2023 12:33 AM
Hi @Navneet Sonak
Hope you are doing well.
Thank you for posting your question in our community! We are happy to assist you.
To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?
This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance!
Cheers!
06-01-2023 10:32 AM
What there has been no answer here! @Debayan Mukherjee @Vartika Nain
So I am running into this same problem as the idea of having to wait 45 minutes for libraries to install is absolutely wild as well as I have done everything outside of working with the docker container.
FROM databricksruntime/standard:9.x
# based on these instructions (avoiding firewall issue for some users):
# https://cran.rstudio.com/bin/linux/ubuntu/#secure-apt
RUN apt-get update \
&& DEBIAN_FRONTEND="noninteractive" apt-get install --yes software-properties-common apt-transport-https \
&& gpg --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9 \
&& gpg -a --export E298A3A825C0D65DFD57CBB651716619E084DAB9 | sudo apt-key add - \
&& add-apt-repository 'deb https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/' \
&& apt-get update \
&& DEBIAN_FRONTEND="noninteractive" apt-get install --yes \
libssl-dev \
r-base \
r-base-dev \
&& add-apt-repository -r 'deb https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/' \
&& apt-key del E298A3A825C0D65DFD57CBB651716619E084DAB9 \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
# UPDATE A SERIES OF PACKAGES
# RUN apt-get update --fix-missing && apt-get install -y ca-certificates libglib2.0-0 libxext6 libsm6 libxrender1 libxml2-dev
# hwriterPlus is used by Databricks to display output in notebook cells
# Rserve allows Spark to communicate with a local R process to run R code
# shiny is used by Databricks interpreter
RUN R -e "install.packages(c('hwriter', 'TeachingDemos', 'htmltools'))"
RUN R -e "install.packages('https://cran.r-project.org/src/contrib/Archive/hwriterPlus/hwriterPlus_1.0-3.tar.gz', repos=NULL, type='source')"
RUN R -e "install.packages('Rserve', repos='http://rforge.net/', type='source')"
RUN R -e "install.packages('shiny', repos='https://cran.rstudio.com/')"
# Added packages for the project that I am currently working on
RUN R -e "install.packages(c('sparklyr', 'remotes', 'plyr', 'dplyr', 'rlist', 'stringr', 'rlist', 'ggplot2', 'patchwork', 'scales', 'Robyn', 'reticulate'))"
# Install nevergrad Python package
RUN python3 -m pip install nevergrad
RUN R -e "library(reticulate); reticulate::py_config()"
RUN R -e "install.packages('devtools', repos='https://cran.rstudio.com/')"
RUN R -e "remotes::install_github('mlflow/mlflow', subdir = 'mlflow/R/mlflow')"
I went with using the runtime because there is a use case for MLflow I get hit by the stan issues as well as the mlflow issues being installed.
it is very clear that R isn't supported much in DB as there was a resolved issue that never was merged into the main and the last time it was updated was 10 months ago.
@Navneet Sonak let me know if you end up solving this with the docker image I would be super grateful
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group