cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

python dataframe or hiveSql update based on predecessor value?

as999
New Contributor III

I have a million in rows that I need to update which looks for the highest count of the predecessor from the same source data and replaces the same value on a different row.  

For example.

Original DF.

sno Object Name  shape  rating

1  Fruit apple round  1.0

2  Fruit apple round  2.0

3  Fruit apple square 2.5

4  Fruit orange round  1.5

```

Required Target DF.

```

sno Object Name  shape rating

1  Fruit apple round 1.0

2  Fruit apple round 2.0

3  Fruit apple round 2.5 <-- automatically detect the difference in shape column and update from square to round

4  Fruit orange round 1.5

```

Pls advise, how to achieve it in databrick using either i.e pyspark or hiveSQL or scala

3 REPLIES 3

-werners-
Esteemed Contributor III

so you want to determine the max number of occurances for a group key?

That is easy: create a df: df:groupBy(Object, Name, Shape).agg(count("*"))

Then join this df with the original and replace the original shape column.

as999
New Contributor III

thanks for reply, can you please elaborate how to join with original and replace the shape column?

-werners-
Esteemed Contributor III

basically you have to create a dataframe (or use a window function, that will also work) which gives you the group combination with the most occurances. So a window/groupby on object, name, shape with a count().

Then you have to determine which shape has the max(count) for a object/name combo.

can also be done using groupby or window.

Finally you filter on this max et voila.

If you use window functions you can avoid a join I think (doin this out of my head).

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group