topic Re: python dataframe or hiveSql update based on predecessor value? in Data Engineering

python dataframe or hiveSql update based on predecessor value?

as999 — Thu, 02 Dec 2021 13:54:33 GMT

I have a million in rows that I need to update which looks for the highest count of the predecessor from the same source data and replaces the same value on a different row.

For example.

Original DF.

sno Object Name shape rating

1 Fruit apple round 1.0

2 Fruit apple round 2.0

3 Fruit apple square 2.5

4 Fruit orange round 1.5

```

Required Target DF.

```

sno Object Name shape rating

1 Fruit apple round 1.0

2 Fruit apple round 2.0

3 Fruit apple round 2.5 <-- automatically detect the difference in shape column and update from square to round

4 Fruit orange round 1.5

```

Pls advise, how to achieve it in databrick using either i.e pyspark or hiveSQL or scala

Re: python dataframe or hiveSql update based on predecessor value?

-werners- — Thu, 02 Dec 2021 15:25:10 GMT

so you want to determine the max number of occurances for a group key?

That is easy: create a df: df:groupBy(Object, Name, Shape).agg(count("*"))

Then join this df with the original and replace the original shape column.

Re: python dataframe or hiveSql update based on predecessor value?

as999 — Thu, 02 Dec 2021 16:58:55 GMT

thanks for reply, can you please elaborate how to join with original and replace the shape column?

Re: python dataframe or hiveSql update based on predecessor value?

-werners- — Fri, 03 Dec 2021 07:43:12 GMT

basically you have to create a dataframe (or use a window function, that will also work) which gives you the group combination with the most occurances. So a window/groupby on object, name, shape with a count().

Then you have to determine which shape has the max(count) for a object/name combo.

can also be done using groupby or window.

Finally you filter on this max et voila.

If you use window functions you can avoid a join I think (doin this out of my head).