Databricks

Braxx · ‎01-05-2022

I am trying to group by a data frame by "PRODUCT", "MARKET" and aggregate the rest ones specified in col_list. There are much more column in the list but for simplification lets take the example below.

Unfortunatelly I am getting the error:

"TypeError: unhashable type: 'Column'

on the line with expr

col_list = ["value", "units"]
 
exprs = {sum(x).alias(x) for x in col_list}
df2 = df1.groupBy("PRODUCT", "MARKET").agg(exprs)

TIA

Braxx · ‎01-10-2022

ya, Thanks. that's one thing. Another one was a missing "*".

Complete answer:

col_list = ["value", "units"]
 
exprs = [sum(x).alias(x) for x in col_list]
df2 = df1.groupBy("PRODUCT", "MARKET").agg(*exprs)

View solution in original post

Anonymous · ‎01-05-2022

I think you'll need to comma separate each sum within the aggregate. I've never seen a list comprehension in the aggregate before.

.agg(sum(y).alias(y),
         sum(x).alias(x),
       .....)

Reza · ‎01-06-2022

You should replace "{" with "["

exprs = [sum(x).alias(x) for x in col_list]

Braxx · ‎01-10-2022

ya, Thanks. that's one thing. Another one was a missing "*".

Complete answer:

col_list = ["value", "units"]
 
exprs = [sum(x).alias(x) for x in col_list]
df2 = df1.groupBy("PRODUCT", "MARKET").agg(*exprs)

Shivers · ‎01-09-2022

The result should be a table with two columns, one is Year(2008, 2009), the other is Annual Income $2500,$2000.But it didn't work unless I had to group by both Year and Income (this will cause the result to be different from what I want with grouping by Year only.

Target Card Balance Now

Pholo · ‎01-10-2022

Hi @Shivers Robert

Try to use something like that

import pyspark.sql.functions as F
 
def year_sum(year, column_year, column_sum):
  return F.when(
    F.col(column_year) == year, F.col(column_sum)
  ).otherwise(F.lit(None))
  
display(df.select(*[F.sum(year_sum(i, 'year', 'your_column_variable')).alias(str(i)) for i in [2018, 2019]]))
#### OR you can use the pivot method
display(df.groupby(F.lit('fake')).pivot('year').agg(F.sum('your_column_variable')).drop('fake'))

let meknow if it works.

Ralphma · 2 weeks ago

The error you're encountering, "TypeError: unhashable type: 'Column'," is likely due to the way you're defining exprs. In Python, sets use curly braces {}, but they require their items to be hashable. Since the result of sum(x).alias(x) is not hashable (it's a Column object), you cannot use a set here.

Instead, you should be using a dictionary comprehension to create a dictionary of column expressions if you want to use the .agg() method with multiple aggregations. Here's how you can modify your code:

from pyspark.sql import functions as F

col_list = ["value", "units"]
 
# Create a dictionary of column expressions for aggregation
exprs = {x: "sum" for x in col_list}
df2 = df1.groupBy("PRODUCT", "MARKET").agg(exprs)

However, if you want to use custom aliases for your aggregated columns, you would need to pass the aggregate functions as separate arguments to the .agg() method like this:

from pyspark.sql import functions as F

col_list = ["value", "units"]

# Create a list of column expressions for aggregation
exprs = [F.sum(x).alias(x) for x in col_list]
df2 = df1.groupBy("PRODUCT", "MARKET").agg(*exprs)

In this code snippet, exprs is a list of Column objects created by the list comprehension. The *exprs syntax unpacks the list into separate arguments for the .agg() method.

Note that in both examples, you need to import the functions module from pyspark.sql as F to be able to use the sum function and other aggregate functions.

Remember to adjust the imports and function calls according to the actual data processing library you are using, as the code provided assumes you are using PySpark. If you are using a different library (like pandas), the syntax for aggregation will be different. Pay by plate ma

Databricks

issue with group by

Unity Catalog Lakeguard: Industry-first and only data governance for multi-user Apache™ Spark cluste

Announcing the General Availability of Databricks Asset Bundles

Register now and save 50% on training at Data + AI Summit!

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs