<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: `collect()`ing Large Datasets in R in Machine Learning</title>
    <link>https://community.databricks.com/t5/machine-learning/collect-ing-large-datasets-in-r/m-p/24679#M1381</link>
    <description>&lt;P&gt;Yes, the name of the dataframe was a little sloppy since it's a Pandas dataframe. Although about the scale, all of the machine learning documentation and sample ML notebooks for DataBricks that I have seen load the dataset into memory on the driver node. And if I remember right the guidance I read from DataBricks was to avoid using a spark-compatible training algorithm as long as your data could fit into memory on the driver node. So while a 5GB dataset could fit on my laptop I'm a little worried that if I can't load 5GB from a Delta Table onto the driver node I almost certainly won't be able to load a larger dataset that wouldn't fit on my laptop, say 50 GB. Plus the dataset contains protected health information which I'm not permitted to download onto my laptop anyway.&lt;/P&gt;</description>
    <pubDate>Tue, 01 Nov 2022 16:06:21 GMT</pubDate>
    <dc:creator>acsmaggart</dc:creator>
    <dc:date>2022-11-01T16:06:21Z</dc:date>
    <item>
      <title>`collect()`ing Large Datasets in R</title>
      <link>https://community.databricks.com/t5/machine-learning/collect-ing-large-datasets-in-r/m-p/24675#M1377</link>
      <description>&lt;P&gt;Background: I'm working on a pilot project to assess the pros and cons of using DataBricks to train models using R. I am using a dataset that occupies about 5.7GB of memory when loaded into a pandas dataframe. The data are stored in a delta table in Unity Catalog.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Problem: I can `collect()` the data using python (pyspark) in about 2 minutes. However, when I tried to use sparklyr to collect the same dataset in R the command was still running after ~2.5 days. I can't load the dataset into DBFS first because we need stricter data-access controls than DBFS will allow. Below are screenshots of the cells that I ran to `collect()` the data in Python and R.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I'm hoping that I'm just missing something about how sparklyr loads data.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Here is the cell that loads the data using pyspark, you can see that it took 2.04 minutes to complete:&lt;span class="lia-inline-image-display-wrapper" image-alt="collecting the data using pyspark"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1275iBB9E68885C67BC77/image-size/large?v=v2&amp;amp;px=999" role="button" title="collecting the data using pyspark" alt="collecting the data using pyspark" /&gt;&lt;/span&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Here is the cell that loads the data using sparklyr, you can see that I cancelled it after 2.84 days:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="collecting the data using R"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1267iB138AA38561D0359/image-size/large?v=v2&amp;amp;px=999" role="button" title="collecting the data using R" alt="collecting the data using R" /&gt;&lt;/span&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I also tried using the `sparklyr::spark_read_table` function but I got an error that `Table or view not found: main.databricks_...` which I think must be because the table is in a metastore managed by Unity Catalog.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Environment Info:&lt;/P&gt;&lt;P&gt;Databricks Runtime: 10.4 LTS&lt;/P&gt;&lt;P&gt;Driver Node Size: 140GB memory and 20 cores&lt;/P&gt;&lt;P&gt;Worker Nodes: 1 worker node with 56GB of memory and 8 cores.&lt;/P&gt;&lt;P&gt;R libraries installed: arrow, sparklyr, SparkR, dplyr&lt;/P&gt;</description>
      <pubDate>Mon, 31 Oct 2022 18:37:06 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/collect-ing-large-datasets-in-r/m-p/24675#M1377</guid>
      <dc:creator>acsmaggart</dc:creator>
      <dc:date>2022-10-31T18:37:06Z</dc:date>
    </item>
    <item>
      <title>Re: `collect()`ing Large Datasets in R</title>
      <link>https://community.databricks.com/t5/machine-learning/collect-ing-large-datasets-in-r/m-p/24676#M1378</link>
      <description>&lt;P&gt;If you have 5GB of data, you don't need spark.  Just use your laptop.  Spark is for scale and won't out perform well on small data sets because of all the overhead distributed requires.  &lt;/P&gt;&lt;P&gt;Also, don't name a pandas dataframe df_spark_.  Just name it something_pdf.&lt;/P&gt;</description>
      <pubDate>Mon, 31 Oct 2022 20:52:27 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/collect-ing-large-datasets-in-r/m-p/24676#M1378</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2022-10-31T20:52:27Z</dc:date>
    </item>
    <item>
      <title>Re: `collect()`ing Large Datasets in R</title>
      <link>https://community.databricks.com/t5/machine-learning/collect-ing-large-datasets-in-r/m-p/24677#M1379</link>
      <description>&lt;P&gt;Have you tried performing `collect()` with SparkR? That would require loading the data as a SparkR DataFrame.&lt;/P&gt;</description>
      <pubDate>Tue, 01 Nov 2022 06:18:00 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/collect-ing-large-datasets-in-r/m-p/24677#M1379</guid>
      <dc:creator>User16781341549</dc:creator>
      <dc:date>2022-11-01T06:18:00Z</dc:date>
    </item>
    <item>
      <title>Re: `collect()`ing Large Datasets in R</title>
      <link>https://community.databricks.com/t5/machine-learning/collect-ing-large-datasets-in-r/m-p/24678#M1380</link>
      <description>&lt;P&gt;That is a good suggestion, and something I probably should have tried already. Although when I use&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;SparkR::collect&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I get a JVM error:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;java.lang.OutOfMemoryError: Requested array size exceeds VM limit&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 01 Nov 2022 16:00:59 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/collect-ing-large-datasets-in-r/m-p/24678#M1380</guid>
      <dc:creator>acsmaggart</dc:creator>
      <dc:date>2022-11-01T16:00:59Z</dc:date>
    </item>
    <item>
      <title>Re: `collect()`ing Large Datasets in R</title>
      <link>https://community.databricks.com/t5/machine-learning/collect-ing-large-datasets-in-r/m-p/24679#M1381</link>
      <description>&lt;P&gt;Yes, the name of the dataframe was a little sloppy since it's a Pandas dataframe. Although about the scale, all of the machine learning documentation and sample ML notebooks for DataBricks that I have seen load the dataset into memory on the driver node. And if I remember right the guidance I read from DataBricks was to avoid using a spark-compatible training algorithm as long as your data could fit into memory on the driver node. So while a 5GB dataset could fit on my laptop I'm a little worried that if I can't load 5GB from a Delta Table onto the driver node I almost certainly won't be able to load a larger dataset that wouldn't fit on my laptop, say 50 GB. Plus the dataset contains protected health information which I'm not permitted to download onto my laptop anyway.&lt;/P&gt;</description>
      <pubDate>Tue, 01 Nov 2022 16:06:21 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/collect-ing-large-datasets-in-r/m-p/24679#M1381</guid>
      <dc:creator>acsmaggart</dc:creator>
      <dc:date>2022-11-01T16:06:21Z</dc:date>
    </item>
    <item>
      <title>Re: `collect()`ing Large Datasets in R</title>
      <link>https://community.databricks.com/t5/machine-learning/collect-ing-large-datasets-in-r/m-p/24680#M1382</link>
      <description>&lt;P&gt;Hi @Max Taggart​&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;We'd love to hear from you.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 11 Jan 2023 07:18:56 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/collect-ing-large-datasets-in-r/m-p/24680#M1382</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2023-01-11T07:18:56Z</dc:date>
    </item>
    <item>
      <title>Re: `collect()`ing Large Datasets in R</title>
      <link>https://community.databricks.com/t5/machine-learning/collect-ing-large-datasets-in-r/m-p/60024#M2996</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/53246"&gt;@acsmaggart&lt;/a&gt;&amp;nbsp;Please try using collect_larger() to collect the larger dataset. This should work. Please refer to the following document for more info on the library.&lt;BR /&gt;&lt;A href="https://medium.com/@NotZacDavies/collecting-large-results-with-sparklyr-8256a0370ec6" target="_blank"&gt;https://medium.com/@NotZacDavies/collecting-large-results-with-sparklyr-8256a0370ec6&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 13 Feb 2024 11:17:37 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/collect-ing-large-datasets-in-r/m-p/60024#M2996</guid>
      <dc:creator>Annapurna_Hiriy</dc:creator>
      <dc:date>2024-02-13T11:17:37Z</dc:date>
    </item>
  </channel>
</rss>

