cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
cancel
Showing results for 
Search instead for 
Did you mean: 

`collect()`ing Large Datasets in R

acsmaggart
New Contributor III

Background: I'm working on a pilot project to assess the pros and cons of using DataBricks to train models using R. I am using a dataset that occupies about 5.7GB of memory when loaded into a pandas dataframe. The data are stored in a delta table in Unity Catalog.

Problem: I can `collect()` the data using python (pyspark) in about 2 minutes. However, when I tried to use sparklyr to collect the same dataset in R the command was still running after ~2.5 days. I can't load the dataset into DBFS first because we need stricter data-access controls than DBFS will allow. Below are screenshots of the cells that I ran to `collect()` the data in Python and R.

I'm hoping that I'm just missing something about how sparklyr loads data.

Here is the cell that loads the data using pyspark, you can see that it took 2.04 minutes to complete:collecting the data using pyspark 

Here is the cell that loads the data using sparklyr, you can see that I cancelled it after 2.84 days:

collecting the data using R 

I also tried using the `sparklyr::spark_read_table` function but I got an error that `Table or view not found: main.databricks_...` which I think must be because the table is in a metastore managed by Unity Catalog.

Environment Info:

Databricks Runtime: 10.4 LTS

Driver Node Size: 140GB memory and 20 cores

Worker Nodes: 1 worker node with 56GB of memory and 8 cores.

R libraries installed: arrow, sparklyr, SparkR, dplyr

6 REPLIES 6

Anonymous
Not applicable

If you have 5GB of data, you don't need spark. Just use your laptop. Spark is for scale and won't out perform well on small data sets because of all the overhead distributed requires.

Also, don't name a pandas dataframe df_spark_. Just name it something_pdf.

acsmaggart
New Contributor III

Yes, the name of the dataframe was a little sloppy since it's a Pandas dataframe. Although about the scale, all of the machine learning documentation and sample ML notebooks for DataBricks that I have seen load the dataset into memory on the driver node. And if I remember right the guidance I read from DataBricks was to avoid using a spark-compatible training algorithm as long as your data could fit into memory on the driver node. So while a 5GB dataset could fit on my laptop I'm a little worried that if I can't load 5GB from a Delta Table onto the driver node I almost certainly won't be able to load a larger dataset that wouldn't fit on my laptop, say 50 GB. Plus the dataset contains protected health information which I'm not permitted to download onto my laptop anyway.

User16781341549
New Contributor II

Have you tried performing `collect()` with SparkR? That would require loading the data as a SparkR DataFrame.

That is a good suggestion, and something I probably should have tried already. Although when I use

SparkR::collect

I get a JVM error:

java.lang.OutOfMemoryError: Requested array size exceeds VM limit

Anonymous
Not applicable

Hi @Max Taggart​ 

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

Annapurna_Hiriy
New Contributor III
New Contributor III

@acsmaggart Please try using collect_larger() to collect the larger dataset. This should work. Please refer to the following document for more info on the library.
https://medium.com/@NotZacDavies/collecting-large-results-with-sparklyr-8256a0370ec6

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.