collect() in SparkR and sparklyr

jlynlangford — Wed, 14 May 2025 20:19:50 GMT

Hello,

I'm have a vast difference in performance between SparkR:collect() and sparklyr:collect. I have a somewhat complicated query that uses WITH AS syntax to get the data set I need; there are several views defined and joins required. The final data set of this particular query is only about 2.5M rows. I am running this in an R notebook. sql(query) andsdf_sql(sc, query) take similar time to run, so I believe it is the collect method that is taking the longest amount of time.

~ 22 seconds to run:

%sql <query>

~ 2 minutes to run:

library(SparkR)

data_SparkR = SparkR::collect(sql(query))

~ 30 minutes to run:

library(sparklyr)

sc <- spark_connect(method = "databricks")

data_sparklyr = sparklyr::sdf_collect(sparklyr::sdf_sql(sc, query))

Can anyone help me understand why sparklyr is taking so much longer than SparkR? SparkR has been deprecated for more recent Databricks environments, so I can't simply switch to SparkR. Having to wait an extra 15x for my queries to run is quite cumbersome. Any insight would be greatly appreciated.

Thanks,

Jess

PS I'm using sparklyr 1.8.6, as the latest version (1.9.0) gives errors related to JAR files. Not sure if being able to use the most recent version of sparklyr would fix the performance issues.

Re: collect() in SparkR and sparklyr

niteshm — Wed, 14 May 2025 21:23:15 GMT

@jlynlangford

This is a tricky situation, and multiple resolutions can be tried to address the performance gap,

Schema Complexity: If the DataFrame contains nested structs, arrays, or map types, collect() can become significantly slower due to complex serialization. Flattening the schema before collecting can help reduce this overhead and improve performance.

Data Partitioning: Before using collect(), inspect how the data is partitioned using, sdf_num_partitions.If there are too many or skewed partitions, consider repartitioning the DataFrame.

Switch Interface: As an effective workaround, use %sql to save the result as a Delta table, and then read it into R using a CSV or Parquet connector. This method avoids the slower collection pipeline in sparklyr.

If any of these steps help improve performance, please share your results—and kudos to you for tackling a challenging optimization issue.

topic Re: collect() in SparkR and sparklyr in Data Engineering

collect() in SparkR and sparklyr

Re: collect() in SparkR and sparklyr