@jlynlangford
This is a tricky situation, and multiple resolutions can be tried to address the performance gap, Schema Complexity: If the DataFrame contains nested structs, arrays, or map types, collect() can become significantly slower due to complex serialization. Flattening the schema before collecting can help reduce this overhead and improve performance.
Data Partitioning: Before using collect(), inspect how the data is partitioned using, sdf_num_partitions.If there are too many or skewed partitions, consider repartitioning the DataFrame.
Switch Interface: As an effective workaround, use %sql to save the result as a Delta table, and then read it into R using a CSV or Parquet connector. This method avoids the slower collection pipeline in sparklyr.
If any of these steps help improve performance, please share your results—and kudos to you for tackling a challenging optimization issue.