cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Arrow R package fails to install

apw
New Contributor II
# Databricks notebook source
.libPaths()
 
# COMMAND ----------
 
dir("/databricks/spark/R/lib")
 
# COMMAND ----------
 
## Add current working directory to library paths
.libPaths(c(getwd(), .libPaths()))
 
# COMMAND ----------
 
## The latest versions from CRAN
install.packages(c('arrow', 'tidyverse', 'aws.s3', 'sparklyr', 'cluster', 'sqldf', 'lubridate', 'ChannelAttribution'), repos = "http://cran.us.r-project.org")
 
# COMMAND ----------
 
dir("/tmp/Rserv/conn970")
 
# COMMAND ----------
 
## Copy from driver to DBFS
system("cp -R /tmp/Rserv/conn970 /usr/lib/R/site-library")
 
# COMMAND ----------
 
dir("/usr/lib/R/site-library")
 
# COMMAND ----------
 
## Copy from driver to DBFS
system("cp -R /tmp/Rserv/conn970 /dbfs/r-libraries")
 
# COMMAND ----------
 
dir("/dbfs/r-libraries")
# COMMAND ----------
 
# Add packages to libPaths
.libPaths("/dbfs/r-libraries")
 
# COMMAND ----------
 
# Check that the dbfs libraries are in libPath
.libPaths()

About 6 weeks ago (early April 2022), I had tested a workflow to ensure that I could trigger jobs on databricks remotely from Airflow, which was successful.

As part of the process the workflow activates a pre-built compute, it then loads various R libraries from DBFS into the compute, one of the packages is 'arrow', however while all the other packages load without issue this package fails to load successfully and then causes my workflow to crash.

When I look into the workflow I get the following error 'DRIVER_LIBRARY_INSTALLATION_FAILURE. Error Message: Command to install library [RCranPkgId(arrow,None,None)] on [0303-130414-840hkwxf] orgId [5132544506122561] failed inside Databricks infrastructure', see image below. Arrow Fail Message" data-fileid="0698Y00000JFZosQAHI have triggered workflow directly inside of databricks and still get the same problem, so clearly it does not have an airflow related cause.

I tried to delete the arrow package from dbfs to see if I could run the test without it, but everytime I delete it, it returns when I retry the workflow.

I then checked CRAN to see if arrow had been updated recently, it was on the 2022-05-09, so I loaded an older version instead, (having first deleted everything relating to it from dbfs), this didn't work either, see images attached.

I have also attached the script I'm using in R to load the packages to dbfs, which I think is good as every other package load properly, however it may be of use in understanding what I am doing or why the error occurs.

What I'd like to know is:

  1. Do you see anything that I might be doing incorrectly inside my attached libraries script?
  2. Is there an issue with loading the arrow package and if so do you have a work around that can prevent the failure?
  3. Why does dbfs continue to re-install arrow despite my removal of it from the directory?
  4. Can I permanently remove the arrow package from dbfs without it returning everytime I trigger the workflow?

Many thanks in advance for any help you guys can offer.

1 REPLY 1

Atanu
Databricks Employee
Databricks Employee

@Anthony McGrathโ€‹ can you please download and upload to DBFS and see if the issue still persists?You can check if any global initscript is reinstalling this to your cluster.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group