cancel
Showing results for 
Search instead for 
Did you mean: 

Performance issues when loading an Excel file from DBFS using R

mbaumga
New Contributor III

I have uploaded small Excel files on my DBFS. I then use function read_xlsx() from the "readxl" package in R to import the file into the R memory. I use a standard cluster (12.1, non ML). The function works but it takes ages. E.g. a simple Excel table with 40000+ records and 5 columns takes 9 minutes. On my R installation on Windows, the load is instantaneous. "readxl" is considered to be the best package to deal with Excel files. It is part of the already made available libraries on the cluster. Any idea what might cause this?

4 REPLIES 4

Kaniz
Community Manager
Community Manager

Hi @Marcel Baumgartner​, It is unusual for the read_xlsx() function from the readxl package to take so long to read a relatively small Excel file. There could be several reasons for this performance issue, including cluster configuration, available resources, or underlying file storage.

Here are some suggestions to diagnose and potentially resolve the performance issue:

  • Cluster resources: Check if your cluster has enough resources (CPU, memory) to handle the operation. If your cluster is under high load or has limited resources, it might be causing the performance issue. You can monitor the cluster's resource usage in the Databricks UI under the Clusters tab.
  • DBFS latency: DBFS might have higher latency compared to your local file system, causing the slow read times. You can test this hypothesis by copying the file to the local file system of the driver node and then read it using read_xlsx().

Use the following command to copy the file from DBFS to the local file system:

%fs cp dbfs:/path/to/your/excel_file /tmp/local_excel_file

Then, in your R code, read the local file:

library(readxl)
df <- read_xlsx("/tmp/local_excel_file")

  • Alternative libraries: If the issue persists, you can try using an alternative library, such as readxlsb or openxlsx, to read the Excel file. This can help identify if the problem is specific to the readxl

package or a more general issue. Keep in mind that you might need to install these libraries if they are not already available on your cluster.

  • Convert to CSV: As a last resort, you can convert the Excel file to a CSV format, which is generally faster to read in R. You can do this conversion either locally on your computer or programmatically using Python's pandas library. Once the file is in CSV format, use R's read.csv() or the reader package's read_csv() function to read the file, which should be significantly faster.

Please note that R in Databricks runs inside a container, and its performance might be different from your local R installation. Also, keep in mind that the performance of the R environment in Databricks might not be as optimized as the PySpark environment, which is native to Databricks. If the performance issue persists, consider using PySpark to read and process the data.

mbaumga
New Contributor III

Dear Kaniz, thanks. Your suggestion to copy the Excel file from DBFS to a temp folder "e.g. /tmp/" worked. Once I run read_xlsx() the upload is instantenous.

Indeed, we had the same problem the other way around. I was not able to save an xlsx or pdf file into a folder on "/dbfs/FileStore/*". However, I can save it to "/tmp/*" and then copy it to where I want the file to be. Good to know!

Thanks again!

Anonymous
Not applicable

Hi @Marcel Baumgartner​ 

Thank you for posting your question in our community! We are happy to assist you.

To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?

This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance! 

mbaumga
New Contributor III

The questions was answered, successfully. All okay!

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.