python dataframe or hiveSql update based on predec...

as999 · ‎12-02-2021

I have a million in rows that I need to update which looks for the highest count of the predecessor from the same source data and replaces the same value on a different row.

For example.

Original DF.

sno Object Name shape rating

1 Fruit apple round 1.0

2 Fruit apple round 2.0

3 Fruit apple square 2.5

4 Fruit orange round 1.5

```

Required Target DF.

```

sno Object Name shape rating

1 Fruit apple round 1.0

2 Fruit apple round 2.0

3 Fruit apple round 2.5 <-- automatically detect the difference in shape column and update from square to round

4 Fruit orange round 1.5

```

Pls advise, how to achieve it in databrick using either i.e pyspark or hiveSQL or scala

-werners- · ‎12-02-2021

so you want to determine the max number of occurances for a group key?

That is easy: create a df: df:groupBy(Object, Name, Shape).agg(count("*"))

Then join this df with the original and replace the original shape column.

as999 · ‎12-02-2021

thanks for reply, can you please elaborate how to join with original and replace the shape column?

-werners- · ‎12-02-2021

basically you have to create a dataframe (or use a window function, that will also work) which gives you the group combination with the most occurances. So a window/groupby on object, name, shape with a count().

Then you have to determine which shape has the max(count) for a object/name combo.

can also be done using groupby or window.

Finally you filter on this max et voila.

If you use window functions you can avoid a join I think (doin this out of my head).

python dataframe or hiveSql update based on predecessor value?