- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-30-2015 08:58 PM
Data has 2 columns:
|requestDate|requestDuration|
| 2015-06-17| 104|
Here is the code:
avgSaveTimesByDate = gridSaves.groupBy(gridSaves.requestDate).agg({"requestDuration": "min", "requestDuration": "max","requestDuration": "avg"})
avgSaveTimesByDate.show(100)
Summary of Issue
I expect 4 columns of data: date, min, max and average but only the date and average shows. The first 2 aggs do not show up. If I move max to the last position, only date and max shows up. Very weird.
+-----------+--------------------+ |requestDate|AVG(requestDuration)| +-----------+--------------------+
| 2015-06-10| 750.8886326991035|
Am I doing this incorrectly? I am trying to get a dataframe for a box plot.
- Labels:
-
Aggregations
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-12-2016 12:41 PM
My guess is that the reason this may not work is the fact that the dictionary input does not have unique keys. With this syntax, column-names are keys and if you have two or more aggregation for the same column, some internal loops may forget the non-uniqueness of the keys.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-25-2015 07:11 AM
Hi Francis,
Thanks for reaching out.
I just tried this in version 2.0 of Databricks and it appeared to work as expected.
Are you using version 2.0 and Spark 1.4?
If so I would suggest using this alternate syntax:
from pyspark.sql import functions as Faggs = df.groupBy("cut").agg(df.cut, F.min("carat"), F.max("carat"), F.avg("carat"))
Let me know if that works for you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-12-2016 12:41 PM
My guess is that the reason this may not work is the fact that the dictionary input does not have unique keys. With this syntax, column-names are keys and if you have two or more aggregation for the same column, some internal loops may forget the non-uniqueness of the keys.

