cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

collect() in SparkR and sparklyr

jlynlangford
New Contributor

Hello,

I'm have a vast difference in performance between SparkR:collect() and sparklyr:collect. I have a somewhat complicated query that uses WITH AS syntax to get the data set I need; there are several views defined and joins required. The final data set of this particular query is only about 2.5M rows. I am running this in an R notebook. sql(query) andsdf_sql(sc, query) take similar time to run, so I believe it is the collect method that is taking the longest amount of time. 

 
~ 22 seconds to run:
%sql <query>
 
~ 2 minutes to run:
library(SparkR)
data_SparkR = SparkR::collect(sql(query))
 
~ 30 minutes to run:
library(sparklyr)
sc <- spark_connect(method = "databricks")
data_sparklyr = sparklyr::sdf_collect(sparklyr::sdf_sql(sc, query))
 
Can anyone help me understand why sparklyr is taking so much longer than SparkR? SparkR has been deprecated for more recent Databricks environments, so I can't simply switch to SparkR. Having to wait an extra 15x for my queries to run is quite cumbersome. Any insight would be greatly appreciated.
 
Thanks,
Jess
 
PS I'm using sparklyr 1.8.6, as the latest version (1.9.0) gives errors related to JAR files. Not sure if being able to use the most recent version of sparklyr would fix the performance issues. 
1 REPLY 1

niteshm
New Contributor III
@jlynlangford 

This is a tricky situation, and multiple resolutions can be tried to address the performance gap,

Schema Complexity: If the DataFrame contains nested structs, arrays, or map types, collect() can become significantly slower due to complex serialization. Flattening the schema before collecting can help reduce this overhead and improve performance.

Data Partitioning: Before using collect(), inspect how the data is partitioned using, sdf_num_partitions.If there are too many or skewed partitions, consider repartitioning the DataFrame.

Switch Interface: As an effective workaround, use %sql to save the result as a Delta table, and then read it into R using a CSV or Parquet connector. This method avoids the slower collection pipeline in sparklyr.

If any of these steps help improve performance, please share your results—and kudos to you for tackling a challenging optimization issue.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now