Databricks Community

Braxx · ‎01-05-2022

I am trying to group by a data frame by "PRODUCT", "MARKET" and aggregate the rest ones specified in col_list. There are much more column in the list but for simplification lets take the example below.

Unfortunatelly I am getting the error:

"TypeError: unhashable type: 'Column'

on the line with expr

col_list = ["value", "units"]
 
exprs = {sum(x).alias(x) for x in col_list}
df2 = df1.groupBy("PRODUCT", "MARKET").agg(exprs)

TIA

Braxx · ‎01-10-2022

ya, Thanks. that's one thing. Another one was a missing "*".

Complete answer:

col_list = ["value", "units"]
 
exprs = [sum(x).alias(x) for x in col_list]
df2 = df1.groupBy("PRODUCT", "MARKET").agg(*exprs)

View solution in original post

Anonymous · ‎01-05-2022

I think you'll need to comma separate each sum within the aggregate. I've never seen a list comprehension in the aggregate before.

.agg(sum(y).alias(y),
         sum(x).alias(x),
       .....)

Reza · ‎01-06-2022

You should replace "{" with "["

exprs = [sum(x).alias(x) for x in col_list]

Braxx · ‎01-10-2022

ya, Thanks. that's one thing. Another one was a missing "*".

Complete answer:

col_list = ["value", "units"]
 
exprs = [sum(x).alias(x) for x in col_list]
df2 = df1.groupBy("PRODUCT", "MARKET").agg(*exprs)

Shivers · ‎01-09-2022

The result should be a table with two columns, one is Year(2008, 2009), the other is Annual Income $2500,$2000.But it didn't work unless I had to group by both Year and Income (this will cause the result to be different from what I want with grouping by Year only.

Target Card Balance Now

Pholo · ‎01-10-2022

Hi @Shivers Robert

Try to use something like that

import pyspark.sql.functions as F
 
def year_sum(year, column_year, column_sum):
  return F.when(
    F.col(column_year) == year, F.col(column_sum)
  ).otherwise(F.lit(None))
  
display(df.select(*[F.sum(year_sum(i, 'year', 'your_column_variable')).alias(str(i)) for i in [2018, 2019]]))
#### OR you can use the pivot method
display(df.groupby(F.lit('fake')).pivot('year').agg(F.sum('your_column_variable')).drop('fake'))

let meknow if it works.

Ralphma · ‎04-08-2024

The error you're encountering, "TypeError: unhashable type: 'Column'," is likely due to the way you're defining exprs. In Python, sets use curly braces {}, but they require their items to be hashable. Since the result of sum(x).alias(x) is not hashable (it's a Column object), you cannot use a set here.

Instead, you should be using a dictionary comprehension to create a dictionary of column expressions if you want to use the .agg() method with multiple aggregations. Here's how you can modify your code:

from pyspark.sql import functions as F

col_list = ["value", "units"]
 
# Create a dictionary of column expressions for aggregation
exprs = {x: "sum" for x in col_list}
df2 = df1.groupBy("PRODUCT", "MARKET").agg(exprs)

However, if you want to use custom aliases for your aggregated columns, you would need to pass the aggregate functions as separate arguments to the .agg() method like this:

from pyspark.sql import functions as F

col_list = ["value", "units"]

# Create a list of column expressions for aggregation
exprs = [F.sum(x).alias(x) for x in col_list]
df2 = df1.groupBy("PRODUCT", "MARKET").agg(*exprs)

In this code snippet, exprs is a list of Column objects created by the list comprehension. The *exprs syntax unpacks the list into separate arguments for the .agg() method.

Note that in both examples, you need to import the functions module from pyspark.sql as F to be able to use the sum function and other aggregate functions.

Remember to adjust the imports and function calls according to the actual data processing library you are using, as the code provided assumes you are using PySpark. If you are using a different library (like pandas), the syntax for aggregation will be different. Pay by plate ma

Databricks Community

issue with group by

Connect with Databricks Users in Your Area

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences

Databricks Community Champion - December 2024 - Sujesh Menon