โ01-05-2022 02:58 AM
I am trying to group by a data frame by "PRODUCT", "MARKET" and aggregate the rest ones specified in col_list. There are much more column in the list but for simplification lets take the example below.
Unfortunatelly I am getting the error:
"TypeError: unhashable type: 'Column'
on the line with expr
col_list = ["value", "units"]
exprs = {sum(x).alias(x) for x in col_list}
df2 = df1.groupBy("PRODUCT", "MARKET").agg(exprs)
TIA
โ01-10-2022 04:11 AM
ya, Thanks. that's one thing. Another one was a missing "*".
Complete answer:
col_list = ["value", "units"]
exprs = [sum(x).alias(x) for x in col_list]
df2 = df1.groupBy("PRODUCT", "MARKET").agg(*exprs)
โ01-05-2022 04:38 AM
I think you'll need to comma separate each sum within the aggregate. I've never seen a list comprehension in the aggregate before.
.agg(sum(y).alias(y),
sum(x).alias(x),
.....)
โ01-06-2022 12:14 PM
You should replace "{" with "["
exprs = [sum(x).alias(x) for x in col_list]
โ01-10-2022 04:11 AM
ya, Thanks. that's one thing. Another one was a missing "*".
Complete answer:
col_list = ["value", "units"]
exprs = [sum(x).alias(x) for x in col_list]
df2 = df1.groupBy("PRODUCT", "MARKET").agg(*exprs)
โ01-09-2022 09:44 PM
The result should be a table with two columns, one is Year(2008, 2009), the other is Annual Income $2500,$2000.But it didn't work unless I had to group by both Year and Income (this will cause the result to be different from what I want with grouping by Year only.
โ01-10-2022 12:23 AM
Hi @Shivers Robertโ
Try to use something like that
import pyspark.sql.functions as F
def year_sum(year, column_year, column_sum):
return F.when(
F.col(column_year) == year, F.col(column_sum)
).otherwise(F.lit(None))
display(df.select(*[F.sum(year_sum(i, 'year', 'your_column_variable')).alias(str(i)) for i in [2018, 2019]]))
#### OR you can use the pivot method
display(df.groupby(F.lit('fake')).pivot('year').agg(F.sum('your_column_variable')).drop('fake'))
let meknow if it works.
โ04-08-2024 11:07 PM
The error you're encountering, "TypeError: unhashable type: 'Column'," is likely due to the way you're defining exprs. In Python, sets use curly braces {}, but they require their items to be hashable. Since the result of sum(x).alias(x) is not hashable (it's a Column object), you cannot use a set here.
Instead, you should be using a dictionary comprehension to create a dictionary of column expressions if you want to use the .agg() method with multiple aggregations. Here's how you can modify your code:
from pyspark.sql import functions as F col_list = ["value", "units"] # Create a dictionary of column expressions for aggregation exprs = {x: "sum" for x in col_list} df2 = df1.groupBy("PRODUCT", "MARKET").agg(exprs)
However, if you want to use custom aliases for your aggregated columns, you would need to pass the aggregate functions as separate arguments to the .agg() method like this:
from pyspark.sql import functions as F col_list = ["value", "units"] # Create a list of column expressions for aggregation exprs = [F.sum(x).alias(x) for x in col_list] df2 = df1.groupBy("PRODUCT", "MARKET").agg(*exprs)
In this code snippet, exprs is a list of Column objects created by the list comprehension. The *exprs syntax unpacks the list into separate arguments for the .agg() method.
Note that in both examples, you need to import the functions module from pyspark.sql as F to be able to use the sum function and other aggregate functions.
Remember to adjust the imports and function calls according to the actual data processing library you are using, as the code provided assumes you are using PySpark. If you are using a different library (like pandas), the syntax for aggregation will be different. Pay by plate ma
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group