Greetings @Paddy_chu ,
You’re right that sparklyr gives you most dplyr verbs on Spark, but many tidyr verbs (including pivot_wider/pivot_longer) aren’t translated to Spark SQL and thus won’t run lazily on Spark tables. The practical options are to stay in Spark using SQL/conditional aggregation, or collect to R when the result is small.
What works well today
- sparklyr + dbplyr is the recommended stack for tidy-style work on Spark. You can use most dplyr verbs (select, mutate, filter, group_by, summarise, joins, arrange) directly on Spark-backed tables.
- Conditional aggregation in dplyr to “pivot” without leaving Spark. This translates to CASE WHEN + aggregates in Spark SQL and scales well. For example, given columns id, category, value: ```r library(dplyr)
df %>% group_by(id) %>% summarise( A = sum(if_else(category == "A", value, 0.0)), B = sum(if_else(category == "B", value, 0.0)), C = sum(if_else(category == "C", value, 0.0)), .groups = "drop" ) ```
- Spark SQL PIVOT via sparklyr/dbplyr for wide reshapes on large data. Register the source as a temp view, then issue a PIVOT: ```r library(sparklyr) library(dplyr) sc <- spark_connect(master = "local")
sdf_register(df, "df")
wide <- tbl(sc, sql(" SELECT * FROM ( SELECT id, category, value FROM df ) PIVOT ( SUM(value) FOR category IN ('A','B','C') ) ")) ``` This keeps the computation distributed and fast.
- Collect then tidyr for small outputs. If the pivoted result will be modest in size, collect and then use pivot_wider: ```r library(tidyr)
df_small <- df %>% collect() %>% pivot_wider( names_from = category, values_from = value, values_fn = sum ) ```
Handling unknown/dynamic column values
- In Spark, the SQL PIVOT requires enumerating the target column values. You can pull them from Spark and programmatically build the summarise or SQL string: ```r cats <- df %>% distinct(category) %>% collect() %>% pull(category)
library(rlang) summaries <- setNames( lapply(cats, function(cat) expr(sum(if_else(category == !!cat, value, 0.0)))), cats )
df %>% group_by(id) %>% summarise(!!!summaries, .groups = "drop") ``` This stays in Spark and yields a wide table even when categories are discovered at runtime.
Are there packages that make all tidyverse verbs work on Spark?
- There isn’t a single R package that makes “most of the tidyverse” fully available and lazy on Spark. The best-supported path is sparklyr + dbplyr for dplyr grammar, and then for tidyr-like reshaping you either: * Express it in Spark SQL (PIVOT) or conditional aggregation, or * Collect to R and use tidyr when the result is small enough.
- Related helpers you may find useful:
- sparklyr.nested for working with nested/struct/array columns in Spark using tidy-like helpers.
- dbplyr SQL translation features continue to improve, but large-scale tidyr reshaping remains best handled by Spark SQL or conditional aggregation.
Practical tips
- Keep operations lazy and in Spark as long as possible; only
collect() after major reductions.
-
If you know the set of pivot keys, conditional aggregation via dplyr is often the simplest and most portable approach on Spark.
-
For extremely wide pivots, prefer Spark SQL PIVOT to avoid generating huge CASE WHEN expressions.
Cheers, Louis.