Databricks Community

Paddy_chu · ‎03-03-2025

Hello,

I recently discovered the sparklyr package and found it quite useful. After setting up the Spark connection, I can apply dplyr functions to manipulate large tables. However, it seems that any functions outside of dplyr cannot be used on Spark views. For instance, after manipulating the data, I wanted to transpose it using pivot_wider, but this function is only available in the tidyr package. Does anyone know of any R packages that would allow me to use most of the tidyverse syntax on Spark dataframes to process large dataframes quickly?

Thanks.

Louis_Frolio · 3 weeks ago

Greetings @Paddy_chu ,

You’re right that sparklyr gives you most dplyr verbs on Spark, but many tidyr verbs (including pivot_wider/pivot_longer) aren’t translated to Spark SQL and thus won’t run lazily on Spark tables. The practical options are to stay in Spark using SQL/conditional aggregation, or collect to R when the result is small.

What works well today

sparklyr + dbplyr is the recommended stack for tidy-style work on Spark. You can use most dplyr verbs (select, mutate, filter, group_by, summarise, joins, arrange) directly on Spark-backed tables.
Conditional aggregation in dplyr to “pivot” without leaving Spark. This translates to CASE WHEN + aggregates in Spark SQL and scales well. For example, given columns id, category, value: ```r library(dplyr)

df %>% group_by(id) %>% summarise( A = sum(if_else(category == "A", value, 0.0)), B = sum(if_else(category == "B", value, 0.0)), C = sum(if_else(category == "C", value, 0.0)), .groups = "drop" ) ```

Spark SQL PIVOT via sparklyr/dbplyr for wide reshapes on large data. Register the source as a temp view, then issue a PIVOT: ```r library(sparklyr) library(dplyr) sc <- spark_connect(master = "local")

sdf_register(df, "df")

wide <- tbl(sc, sql(" SELECT * FROM ( SELECT id, category, value FROM df ) PIVOT ( SUM(value) FOR category IN ('A','B','C') ) ")) ``` This keeps the computation distributed and fast.

Collect then tidyr for small outputs. If the pivoted result will be modest in size, collect and then use pivot_wider: ```r library(tidyr)

df_small <- df %>% collect() %>% pivot_wider( names_from = category, values_from = value, values_fn = sum ) ```

Handling unknown/dynamic column values

In Spark, the SQL PIVOT requires enumerating the target column values. You can pull them from Spark and programmatically build the summarise or SQL string: ```r cats <- df %>% distinct(category) %>% collect() %>% pull(category)

library(rlang) summaries <- setNames( lapply(cats, function(cat) expr(sum(if_else(category == !!cat, value, 0.0)))), cats )

df %>% group_by(id) %>% summarise(!!!summaries, .groups = "drop") ``` This stays in Spark and yields a wide table even when categories are discovered at runtime.

Are there packages that make all tidyverse verbs work on Spark?

There isn’t a single R package that makes “most of the tidyverse” fully available and lazy on Spark. The best-supported path is sparklyr + dbplyr for dplyr grammar, and then for tidyr-like reshaping you either: * Express it in Spark SQL (PIVOT) or conditional aggregation, or * Collect to R and use tidyr when the result is small enough.
Related helpers you may find useful:
- sparklyr.nested for working with nested/struct/array columns in Spark using tidy-like helpers.
- dbplyr SQL translation features continue to improve, but large-scale tidyr reshaping remains best handled by Spark SQL or conditional aggregation.

Practical tips

Keep operations lazy and in Spark as long as possible; only collect() after major reductions.
If you know the set of pivot keys, conditional aggregation via dplyr is often the simplest and most portable approach on Spark.
For extremely wide pivots, prefer Spark SQL PIVOT to avoid generating huge CASE WHEN expressions.

Cheers, Louis.