<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic How to transpose spark dataframe using R API? in Machine Learning</title>
    <link>https://community.databricks.com/t5/machine-learning/how-to-transpose-spark-dataframe-using-r-api/m-p/111640#M3979</link>
    <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;I recently discovered the sparklyr package and found it quite useful. After setting up the Spark connection, I can apply dplyr functions to manipulate large tables. However, it seems that any functions outside of dplyr cannot be used on Spark views. For instance, after manipulating the data, I wanted to transpose it using pivot_wider, but this function is only available in the tidyr package. Does anyone know of any R packages that would allow me to use most of the tidyverse syntax on Spark dataframes to process large dataframes quickly?&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;Thanks.&lt;/P&gt;</description>
    <pubDate>Mon, 03 Mar 2025 23:38:12 GMT</pubDate>
    <dc:creator>Paddy_chu</dc:creator>
    <dc:date>2025-03-03T23:38:12Z</dc:date>
    <item>
      <title>How to transpose spark dataframe using R API?</title>
      <link>https://community.databricks.com/t5/machine-learning/how-to-transpose-spark-dataframe-using-r-api/m-p/111640#M3979</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;I recently discovered the sparklyr package and found it quite useful. After setting up the Spark connection, I can apply dplyr functions to manipulate large tables. However, it seems that any functions outside of dplyr cannot be used on Spark views. For instance, after manipulating the data, I wanted to transpose it using pivot_wider, but this function is only available in the tidyr package. Does anyone know of any R packages that would allow me to use most of the tidyverse syntax on Spark dataframes to process large dataframes quickly?&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;Thanks.&lt;/P&gt;</description>
      <pubDate>Mon, 03 Mar 2025 23:38:12 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/how-to-transpose-spark-dataframe-using-r-api/m-p/111640#M3979</guid>
      <dc:creator>Paddy_chu</dc:creator>
      <dc:date>2025-03-03T23:38:12Z</dc:date>
    </item>
    <item>
      <title>Re: How to transpose spark dataframe using R API?</title>
      <link>https://community.databricks.com/t5/machine-learning/how-to-transpose-spark-dataframe-using-r-api/m-p/136390#M4376</link>
      <description>&lt;P&gt;Greetings&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/68565"&gt;@Paddy_chu&lt;/a&gt;&amp;nbsp;,&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;DIV class="paragraph"&gt;You’re right that &lt;STRONG&gt;sparklyr&lt;/STRONG&gt; gives you most &lt;STRONG&gt;dplyr&lt;/STRONG&gt; verbs on Spark, but many &lt;STRONG&gt;tidyr&lt;/STRONG&gt; verbs (including &lt;STRONG&gt;pivot_wider&lt;/STRONG&gt;/&lt;STRONG&gt;pivot_longer&lt;/STRONG&gt;) aren’t translated to Spark SQL and thus won’t run lazily on Spark tables. The practical options are to stay in Spark using SQL/conditional aggregation, or collect to R when the result is small.&lt;/DIV&gt;
&lt;DIV class="paragraph"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;H3 class="paragraph"&gt;What works well today&lt;/H3&gt;
&lt;UL&gt;
&lt;LI class="paragraph"&gt;&lt;STRONG&gt;sparklyr + dbplyr&lt;/STRONG&gt; is the recommended stack for tidy-style work on Spark. You can use most dplyr verbs (select, mutate, filter, group_by, summarise, joins, arrange) directly on Spark-backed tables.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Conditional aggregation in dplyr&lt;/STRONG&gt; to “pivot” without leaving Spark. This translates to CASE WHEN + aggregates in Spark SQL and scales well. For example, given columns id, category, value: ```r library(dplyr)&lt;/LI&gt;
&lt;/UL&gt;
&lt;DIV class="paragraph"&gt;df %&amp;gt;% group_by(id) %&amp;gt;% summarise( A = sum(if_else(category == "A", value, 0.0)), B = sum(if_else(category == "B", value, 0.0)), C = sum(if_else(category == "C", value, 0.0)), .groups = "drop" ) ```&lt;/DIV&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Spark SQL PIVOT&lt;/STRONG&gt; via sparklyr/dbplyr for wide reshapes on large data. Register the source as a temp view, then issue a PIVOT: ```r library(sparklyr) library(dplyr) sc &amp;lt;- spark_connect(master = "local")&lt;/LI&gt;
&lt;/UL&gt;
&lt;DIV class="paragraph"&gt;sdf_register(df, "df")&lt;/DIV&gt;
&lt;DIV class="paragraph"&gt;wide &amp;lt;- tbl(sc, sql(" SELECT * FROM ( SELECT id, category, value FROM df ) PIVOT ( SUM(value) FOR category IN ('A','B','C') ) ")) ``` This keeps the computation distributed and fast.&lt;/DIV&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Collect then tidyr&lt;/STRONG&gt; for small outputs. If the pivoted result will be modest in size, collect and then use pivot_wider: ```r library(tidyr)&lt;/LI&gt;
&lt;/UL&gt;
&lt;DIV class="paragraph"&gt;df_small &amp;lt;- df %&amp;gt;% collect() %&amp;gt;% pivot_wider( names_from = category, values_from = value, values_fn = sum ) ```&lt;/DIV&gt;
&lt;DIV class="paragraph"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;H3 class="paragraph"&gt;Handling unknown/dynamic column values&lt;/H3&gt;
&lt;UL&gt;
&lt;LI class="paragraph"&gt;In Spark, the SQL PIVOT requires enumerating the target column values. You can pull them from Spark and programmatically build the summarise or SQL string: ```r cats &amp;lt;- df %&amp;gt;% distinct(category) %&amp;gt;% collect() %&amp;gt;% pull(category)&lt;/LI&gt;
&lt;/UL&gt;
&lt;DIV class="paragraph"&gt;library(rlang) summaries &amp;lt;- setNames( lapply(cats, function(cat) expr(sum(if_else(category == !!cat, value, 0.0)))), cats )&lt;/DIV&gt;
&lt;DIV class="paragraph"&gt;df %&amp;gt;% group_by(id) %&amp;gt;% summarise(!!!summaries, .groups = "drop") ``` This stays in Spark and yields a wide table even when categories are discovered at runtime.&lt;/DIV&gt;
&lt;DIV class="paragraph"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;H3 class="paragraph"&gt;Are there packages that make all tidyverse verbs work on Spark?&lt;/H3&gt;
&lt;UL&gt;
&lt;LI class="paragraph"&gt;There isn’t a single R package that makes “most of the tidyverse” fully available and lazy on Spark. The best-supported path is &lt;STRONG&gt;sparklyr + dbplyr&lt;/STRONG&gt; for dplyr grammar, and then for tidyr-like reshaping you either: * Express it in Spark SQL (PIVOT) or conditional aggregation, or * Collect to R and use tidyr when the result is small enough.&lt;/LI&gt;
&lt;LI&gt;Related helpers you may find useful:
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;sparklyr.nested&lt;/STRONG&gt; for working with nested/struct/array columns in Spark using tidy-like helpers.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;dbplyr SQL translation&lt;/STRONG&gt; features continue to improve, but large-scale tidyr reshaping remains best handled by Spark SQL or conditional aggregation.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 class="paragraph"&gt;Practical tips&lt;/H3&gt;
&lt;UL&gt;
&lt;LI class="paragraph"&gt;Keep operations lazy and in Spark as long as possible; only &lt;CODE&gt;collect()&lt;/CODE&gt; after major reductions.&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;If you know the set of pivot keys, conditional aggregation via dplyr is often the simplest and most portable approach on Spark.&lt;/DIV&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;DIV class="paragraph"&gt;For extremely wide pivots, prefer Spark SQL PIVOT to avoid generating huge CASE WHEN expressions.&lt;/DIV&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;DIV class="paragraph"&gt;&amp;nbsp;&lt;/DIV&gt;
&lt;DIV class="paragraph"&gt;Cheers, Louis.&lt;/DIV&gt;</description>
      <pubDate>Tue, 28 Oct 2025 12:18:54 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/how-to-transpose-spark-dataframe-using-r-api/m-p/136390#M4376</guid>
      <dc:creator>Louis_Frolio</dc:creator>
      <dc:date>2025-10-28T12:18:54Z</dc:date>
    </item>
  </channel>
</rss>

