<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Iterating over a pyspark.pandas.groupby.DataFrameGroupBy in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/iterating-over-a-pyspark-pandas-groupby-dataframegroupby/m-p/63426#M32234</link>
    <description>&lt;P&gt;I have a&amp;nbsp;&lt;SPAN&gt;pyspark.pandas.frame.DataFrame object (that I called from `pandas_api` on a&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;pyspark.sql.dataframe.DataFrame object). &amp;nbsp;I have a complicated transformation that I would like to apply to this data, and in particular I would like to apply it in blocks based on the value of a column 'C'.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;If it had been a&amp;nbsp;&lt;SPAN&gt;pandas.core.frame.DataFrame object, I could do:&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; for _,chunk in df.groupby("C"):&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; // do stuff&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;When I try this with a&amp;nbsp;&lt;SPAN&gt;pyspark.pandas.frame.DataFrame object, I get `&lt;SPAN class=""&gt;KeyError: &lt;/SPAN&gt;(0,)`. &amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;My question is: how do I get access to the grouped data in a&amp;nbsp;&lt;SPAN&gt;pyspark.pandas.groupby.DataFrameGroupBy object? Is this possible at all, or am I only allowed to run aggregate functions?&lt;/SPAN&gt;&lt;/P&gt;&lt;DIV&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Tue, 12 Mar 2024 19:05:31 GMT</pubDate>
    <dc:creator>JacobKesinger</dc:creator>
    <dc:date>2024-03-12T19:05:31Z</dc:date>
    <item>
      <title>Iterating over a pyspark.pandas.groupby.DataFrameGroupBy</title>
      <link>https://community.databricks.com/t5/data-engineering/iterating-over-a-pyspark-pandas-groupby-dataframegroupby/m-p/63426#M32234</link>
      <description>&lt;P&gt;I have a&amp;nbsp;&lt;SPAN&gt;pyspark.pandas.frame.DataFrame object (that I called from `pandas_api` on a&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;pyspark.sql.dataframe.DataFrame object). &amp;nbsp;I have a complicated transformation that I would like to apply to this data, and in particular I would like to apply it in blocks based on the value of a column 'C'.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;If it had been a&amp;nbsp;&lt;SPAN&gt;pandas.core.frame.DataFrame object, I could do:&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; for _,chunk in df.groupby("C"):&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; // do stuff&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;When I try this with a&amp;nbsp;&lt;SPAN&gt;pyspark.pandas.frame.DataFrame object, I get `&lt;SPAN class=""&gt;KeyError: &lt;/SPAN&gt;(0,)`. &amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;My question is: how do I get access to the grouped data in a&amp;nbsp;&lt;SPAN&gt;pyspark.pandas.groupby.DataFrameGroupBy object? Is this possible at all, or am I only allowed to run aggregate functions?&lt;/SPAN&gt;&lt;/P&gt;&lt;DIV&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 12 Mar 2024 19:05:31 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/iterating-over-a-pyspark-pandas-groupby-dataframegroupby/m-p/63426#M32234</guid>
      <dc:creator>JacobKesinger</dc:creator>
      <dc:date>2024-03-12T19:05:31Z</dc:date>
    </item>
    <item>
      <title>Re: Iterating over a pyspark.pandas.groupby.DataFrameGroupBy</title>
      <link>https://community.databricks.com/t5/data-engineering/iterating-over-a-pyspark-pandas-groupby-dataframegroupby/m-p/63471#M32247</link>
      <description>&lt;P&gt;When working with a pyspark.pandas.frame.DataFrame object and needing to apply&amp;nbsp;transformations to grouped data based on a specific column, you can utilize the groupby method followed by the apply function.&amp;nbsp;This way allows you to group the data based on the values of the specified column and then apply custom transformation logic to each group.&lt;/P&gt;&lt;P&gt;Say if you have a pyspark.pandas.frame.DataFrame object named df, and you want to group the data by the column 'C'&amp;nbsp;and then apply a transformation to each group, you can do the following:&lt;/P&gt;&lt;LI-CODE lang="python"&gt;from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType, StringType, StructType, StructField
from pyspark.sql.functions import pandas_udf, PandasUDFType

# Create SparkSession
spark = SparkSession.builder.appName("GroupByTransform").getOrCreate()

# Sample Data
data = {"A": [1, 2, 3, 4, 5], "B": ["x", "y", "x", "z", "x"], "C": [1, 1, 2, 2, 1]}

# Define the schema for the original DataFrame
schema = StructType([
    StructField("A", IntegerType(), True),
    StructField("B", StringType(), True),
    StructField("C", IntegerType(), True)
])

# Create PySpark DataFrame with explicit schema
df = spark.createDataFrame(list(zip(data["A"], data["B"], data["C"])), schema=schema)

# Convert to Pandas DataFrame
pandas_df = df.toPandas()

# Define the schema for the transformed DataFrame
new_schema = StructType([
    StructField("col1", IntegerType(), True),
    StructField("col2", StringType(), True)
])

# Define the transformation function
@pandas_udf(new_schema, PandasUDFType.GROUPED_MAP)
def transform_block(data):
    # Transformation logic (replace with your logic)
    new_data = {
        "col1": data["A"] * 2,  # Example transformation on column 'A'
        "col2": data["B"]  # Example transformation on column 'B'
    }
    return data.assign(**new_data)[["col1", "col2"]]  # Return only the columns specified in new_schema

# Apply transformation using apply
transformed_df = df.groupby("C").apply(transform_block)

# Show the result
transformed_df.show()&lt;/LI-CODE&gt;&lt;P&gt;This way you can work with the grouped data similarly to how you would with a pandas DataFrame,&lt;BR /&gt;enabling you to perform complex transformations.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 13 Mar 2024 08:29:28 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/iterating-over-a-pyspark-pandas-groupby-dataframegroupby/m-p/63471#M32247</guid>
      <dc:creator>MichTalebzadeh</dc:creator>
      <dc:date>2024-03-13T08:29:28Z</dc:date>
    </item>
    <item>
      <title>Re: Iterating over a pyspark.pandas.groupby.DataFrameGroupBy</title>
      <link>https://community.databricks.com/t5/data-engineering/iterating-over-a-pyspark-pandas-groupby-dataframegroupby/m-p/68927#M33772</link>
      <description>&lt;P&gt;Hi Mich,&lt;/P&gt;&lt;P&gt;I have a similar pandas_udf. The scripts failed to run on an all purpose cluster. The error is [UC_COMMAND_NOT_SUPPORTED.WITHOUT_RECOMMENDATION] The command(s): Spark higher-order functions are not supported in Unity Catalog. Do you know by any chance how to make it work with UC or it is not supported by the UC? I couldn't find the relevant documentation.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;</description>
      <pubDate>Mon, 13 May 2024 21:58:20 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/iterating-over-a-pyspark-pandas-groupby-dataframegroupby/m-p/68927#M33772</guid>
      <dc:creator>MMGDGNDD</dc:creator>
      <dc:date>2024-05-13T21:58:20Z</dc:date>
    </item>
    <item>
      <title>Re: Iterating over a pyspark.pandas.groupby.DataFrameGroupBy</title>
      <link>https://community.databricks.com/t5/data-engineering/iterating-over-a-pyspark-pandas-groupby-dataframegroupby/m-p/69005#M33784</link>
      <description>&lt;P&gt;&lt;SPAN&gt;Hi,&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;The error indicates that the Unity Catalog does not support Spark higher-order functions, such as those used in pandas_udf&lt;/SPAN&gt;&lt;SPAN&gt;. This limitation likely comes from architectural or compatibility constraints. To resolve the issue, consider alternative approaches or APIs supported by the Unity Catalog for achieving similar functionality.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 14 May 2024 14:39:39 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/iterating-over-a-pyspark-pandas-groupby-dataframegroupby/m-p/69005#M33784</guid>
      <dc:creator>MichTalebzadeh</dc:creator>
      <dc:date>2024-05-14T14:39:39Z</dc:date>
    </item>
  </channel>
</rss>

