Databricks Community

jlynlangford · ‎05-14-2025

Hello,

I'm have a vast difference in performance between SparkR:collect() and sparklyr:collect. I have a somewhat complicated query that uses WITH AS syntax to get the data set I need; there are several views defined and joins required. The final data set of this particular query is only about 2.5M rows. I am running this in an R notebook. sql(query) andsdf_sql(sc, query) take similar time to run, so I believe it is the collect method that is taking the longest amount of time.

~ 22 seconds to run:

%sql <query>

~ 2 minutes to run:

library(SparkR)

data_SparkR = SparkR::collect(sql(query))

~ 30 minutes to run:

library(sparklyr)

sc <- spark_connect(method = "databricks")

data_sparklyr = sparklyr::sdf_collect(sparklyr::sdf_sql(sc, query))

Can anyone help me understand why sparklyr is taking so much longer than SparkR? SparkR has been deprecated for more recent Databricks environments, so I can't simply switch to SparkR. Having to wait an extra 15x for my queries to run is quite cumbersome. Any insight would be greatly appreciated.

Thanks,

Jess

PS I'm using sparklyr 1.8.6, as the latest version (1.9.0) gives errors related to JAR files. Not sure if being able to use the most recent version of sparklyr would fix the performance issues.

niteshm · ‎05-14-2025

@jlynlangford

This is a tricky situation, and multiple resolutions can be tried to address the performance gap,

Schema Complexity: If the DataFrame contains nested structs, arrays, or map types, collect() can become significantly slower due to complex serialization. Flattening the schema before collecting can help reduce this overhead and improve performance.

Data Partitioning: Before using collect(), inspect how the data is partitioned using, sdf_num_partitions.If there are too many or skewed partitions, consider repartitioning the DataFrame.

Switch Interface: As an effective workaround, use %sql to save the result as a Delta table, and then read it into R using a CSV or Parquet connector. This method avoids the slower collection pipeline in sparklyr.

If any of these steps help improve performance, please share your results—and kudos to you for tackling a challenging optimization issue.

Databricks Community

collect() in SparkR and sparklyr

Join Us as a Local Community Builder!

Lakehouse, Lagers & Legends — Bangalore Meetup | December 13

🌟 Community Pulse: Your Weekly Roundup! November 21 – 27, 2025

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐