cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

issue with group by

Braxx
Contributor II

I am trying to group by a data frame by "PRODUCT", "MARKET" and aggregate the rest ones specified in col_list. There are much more column in the list but for simplification lets take the example below.

Unfortunatelly I am getting the error:

"TypeError: unhashable type: 'Column'

on the line with expr

col_list = ["value", "units"]
 
exprs = {sum(x).alias(x) for x in col_list}
df2 = df1.groupBy("PRODUCT", "MARKET").agg(exprs)

TIA

1 ACCEPTED SOLUTION

Accepted Solutions

Braxx
Contributor II

ya, Thanks. that's one thing. Another one was a missing "*".

Complete answer:

col_list = ["value", "units"]
 
exprs = [sum(x).alias(x) for x in col_list]
df2 = df1.groupBy("PRODUCT", "MARKET").agg(*exprs)

View solution in original post

6 REPLIES 6

Anonymous
Not applicable

I think you'll need to comma separate each sum within the aggregate. I've never seen a list comprehension in the aggregate before.

.agg(sum(y).alias(y),
         sum(x).alias(x),
       .....)

Reza
New Contributor III

You should replace "{" with "["

exprs = [sum(x).alias(x) for x in col_list]

Braxx
Contributor II

ya, Thanks. that's one thing. Another one was a missing "*".

Complete answer:

col_list = ["value", "units"]
 
exprs = [sum(x).alias(x) for x in col_list]
df2 = df1.groupBy("PRODUCT", "MARKET").agg(*exprs)

Shivers
New Contributor II

The result should be a table with two columns, one is Year(2008, 2009), the other is Annual Income $2500,$2000.But it didn't work unless I had to group by both Year and Income (this will cause the result to be different from what I want with grouping by Year only.

Target Card Balance Now

Pholo
Contributor

Hi @Shivers Robertโ€‹ 

Try to use something like that

import pyspark.sql.functions as F
 
def year_sum(year, column_year, column_sum):
  return F.when(
    F.col(column_year) == year, F.col(column_sum)
  ).otherwise(F.lit(None))
  
display(df.select(*[F.sum(year_sum(i, 'year', 'your_column_variable')).alias(str(i)) for i in [2018, 2019]]))
#### OR you can use the pivot method
display(df.groupby(F.lit('fake')).pivot('year').agg(F.sum('your_column_variable')).drop('fake'))

let meknow if it works.

Ralphma
New Contributor II

The error you're encountering, "TypeError: unhashable type: 'Column'," is likely due to the way you're defining exprs. In Python, sets use curly braces {}, but they require their items to be hashable. Since the result of sum(x).alias(x) is not hashable (it's a Column object), you cannot use a set here.

Instead, you should be using a dictionary comprehension to create a dictionary of column expressions if you want to use the .agg() method with multiple aggregations. Here's how you can modify your code:

from pyspark.sql import functions as F

col_list = ["value", "units"]
 
# Create a dictionary of column expressions for aggregation
exprs = {x: "sum" for x in col_list}
df2 = df1.groupBy("PRODUCT", "MARKET").agg(exprs)

However, if you want to use custom aliases for your aggregated columns, you would need to pass the aggregate functions as separate arguments to the .agg() method like this:

from pyspark.sql import functions as F

col_list = ["value", "units"]

# Create a list of column expressions for aggregation
exprs = [F.sum(x).alias(x) for x in col_list]
df2 = df1.groupBy("PRODUCT", "MARKET").agg(*exprs)

In this code snippet, exprs is a list of Column objects created by the list comprehension. The *exprs syntax unpacks the list into separate arguments for the .agg() method.

Note that in both examples, you need to import the functions module from pyspark.sql as F to be able to use the sum function and other aggregate functions.

Remember to adjust the imports and function calls according to the actual data processing library you are using, as the code provided assumes you are using PySpark. If you are using a different library (like pandas), the syntax for aggregation will be different. Pay by plate ma

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group