python dataframe or hiveSql update based on predecessor value?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-02-2021 05:54 AM
I have a million in rows that I need to update which looks for the highest count of the predecessor from the same source data and replaces the same value on a different row.
For example.
Original DF.
sno Object Name shape rating
1 Fruit apple round 1.0
2 Fruit apple round 2.0
3 Fruit apple square 2.5
4 Fruit orange round 1.5
```
Required Target DF.
```
sno Object Name shape rating
1 Fruit apple round 1.0
2 Fruit apple round 2.0
3 Fruit apple round 2.5 <-- automatically detect the difference in shape column and update from square to round
4 Fruit orange round 1.5
```
Pls advise, how to achieve it in databrick using either i.e pyspark or hiveSQL or scala
- Labels:
-
Hive
-
Pyspark
-
Python Dataframe
-
Scala
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-02-2021 07:25 AM
so you want to determine the max number of occurances for a group key?
That is easy: create a df: df:groupBy(Object, Name, Shape).agg(count("*"))
Then join this df with the original and replace the original shape column.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-02-2021 08:58 AM
thanks for reply, can you please elaborate how to join with original and replace the shape column?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-02-2021 11:43 PM
basically you have to create a dataframe (or use a window function, that will also work) which gives you the group combination with the most occurances. So a window/groupby on object, name, shape with a count().
Then you have to determine which shape has the max(count) for a object/name combo.
can also be done using groupby or window.
Finally you filter on this max et voila.
If you use window functions you can avoid a join I think (doin this out of my head).
data:image/s3,"s3://crabby-images/618ac/618ac5f2bf7746f4cdeea1aaad5a0ab2f9192c1d" alt=""
data:image/s3,"s3://crabby-images/618ac/618ac5f2bf7746f4cdeea1aaad5a0ab2f9192c1d" alt=""