python dataframe or hiveSql update based on predecessor value?

as999
New Contributor III

I have a million in rows that I need to update which looks for the highest count of the predecessor from the same source data and replaces the same value on a different row.  

For example.

Original DF.

sno Object Name  shape  rating

1  Fruit apple round  1.0

2  Fruit apple round  2.0

3  Fruit apple square 2.5

4  Fruit orange round  1.5

```

Required Target DF.

```

sno Object Name  shape rating

1  Fruit apple round 1.0

2  Fruit apple round 2.0

3  Fruit apple round 2.5 <-- automatically detect the difference in shape column and update from square to round

4  Fruit orange round 1.5

```

Pls advise, how to achieve it in databrick using either i.e pyspark or hiveSQL or scala

-werners-
Esteemed Contributor III

so you want to determine the max number of occurances for a group key?

That is easy: create a df: df:groupBy(Object, Name, Shape).agg(count("*"))

Then join this df with the original and replace the original shape column.

as999
New Contributor III

thanks for reply, can you please elaborate how to join with original and replace the shape column?

-werners-
Esteemed Contributor III

basically you have to create a dataframe (or use a window function, that will also work) which gives you the group combination with the most occurances. So a window/groupby on object, name, shape with a count().

Then you have to determine which shape has the max(count) for a object/name combo.

can also be done using groupby or window.

Finally you filter on this max et voila.

If you use window functions you can avoid a join I think (doin this out of my head).