<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: python dataframe or hiveSql update based on predecessor value? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/python-dataframe-or-hivesql-update-based-on-predecessor-value/m-p/33882#M24792</link>
    <description>&lt;P&gt;basically you have to create a dataframe (or use a window function, that will also work) which gives you the group combination with the most occurances.  So a window/groupby on object, name, shape with a count().&lt;/P&gt;&lt;P&gt;Then you have to determine which shape has the max(count) for a object/name combo.&lt;/P&gt;&lt;P&gt;can also be done using groupby or window.&lt;/P&gt;&lt;P&gt;Finally you filter on this max et voila.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;If you use window functions you can avoid a join I think (doin this out of my head).&lt;/P&gt;</description>
    <pubDate>Fri, 03 Dec 2021 07:43:12 GMT</pubDate>
    <dc:creator>-werners-</dc:creator>
    <dc:date>2021-12-03T07:43:12Z</dc:date>
    <item>
      <title>python dataframe or hiveSql update based on predecessor value?</title>
      <link>https://community.databricks.com/t5/data-engineering/python-dataframe-or-hivesql-update-based-on-predecessor-value/m-p/33879#M24789</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;P&gt;I have a million in rows that I need to update which looks for the highest count of the predecessor from the same source data and replaces the same value on a different row.&amp;nbsp;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;For example.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Original DF.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;sno Object Name&amp;nbsp;&amp;nbsp;shape&amp;nbsp;&amp;nbsp;rating&lt;/P&gt;&lt;P&gt;1&amp;nbsp;&amp;nbsp;Fruit&amp;nbsp;apple&amp;nbsp;round&amp;nbsp;&amp;nbsp;1.0&lt;/P&gt;&lt;P&gt;2&amp;nbsp;&amp;nbsp;Fruit&amp;nbsp;apple&amp;nbsp;round&amp;nbsp;&amp;nbsp;2.0&lt;/P&gt;&lt;P&gt;3&amp;nbsp;&amp;nbsp;Fruit&amp;nbsp;apple&amp;nbsp;square&amp;nbsp;2.5&lt;/P&gt;&lt;P&gt;4&amp;nbsp;&amp;nbsp;Fruit&amp;nbsp;orange round&amp;nbsp;&amp;nbsp;1.5&lt;/P&gt;&lt;P&gt;```&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Required Target DF.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;```&lt;/P&gt;&lt;P&gt;sno Object Name&amp;nbsp;&amp;nbsp;shape rating&lt;/P&gt;&lt;P&gt;1&amp;nbsp;&amp;nbsp;Fruit&amp;nbsp;apple&amp;nbsp;round 1.0&lt;/P&gt;&lt;P&gt;2&amp;nbsp;&amp;nbsp;Fruit&amp;nbsp;apple&amp;nbsp;round 2.0&lt;/P&gt;&lt;P&gt;3&amp;nbsp;&amp;nbsp;Fruit&amp;nbsp;apple&amp;nbsp;round 2.5 &amp;lt;-- automatically detect the difference in shape column and update from square to round&lt;/P&gt;&lt;P&gt;4&amp;nbsp;&amp;nbsp;Fruit&amp;nbsp;orange round 1.5&lt;/P&gt;&lt;P&gt;```&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Pls advise, how to achieve it in databrick using either i.e&amp;nbsp;pyspark or hiveSQL or scala&lt;/P&gt;</description>
      <pubDate>Thu, 02 Dec 2021 13:54:33 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/python-dataframe-or-hivesql-update-based-on-predecessor-value/m-p/33879#M24789</guid>
      <dc:creator>as999</dc:creator>
      <dc:date>2021-12-02T13:54:33Z</dc:date>
    </item>
    <item>
      <title>Re: python dataframe or hiveSql update based on predecessor value?</title>
      <link>https://community.databricks.com/t5/data-engineering/python-dataframe-or-hivesql-update-based-on-predecessor-value/m-p/33880#M24790</link>
      <description>&lt;P&gt;so you want to determine the max number of occurances for a group key?&lt;/P&gt;&lt;P&gt;That is easy: create a df: df:groupBy(Object, Name, Shape).agg(count("*"))&lt;/P&gt;&lt;P&gt;Then join this df with the original and replace the original shape column.&lt;/P&gt;</description>
      <pubDate>Thu, 02 Dec 2021 15:25:10 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/python-dataframe-or-hivesql-update-based-on-predecessor-value/m-p/33880#M24790</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2021-12-02T15:25:10Z</dc:date>
    </item>
    <item>
      <title>Re: python dataframe or hiveSql update based on predecessor value?</title>
      <link>https://community.databricks.com/t5/data-engineering/python-dataframe-or-hivesql-update-based-on-predecessor-value/m-p/33881#M24791</link>
      <description>&lt;P&gt;thanks for reply, can you please elaborate how to join with original and replace the shape column?&lt;/P&gt;</description>
      <pubDate>Thu, 02 Dec 2021 16:58:55 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/python-dataframe-or-hivesql-update-based-on-predecessor-value/m-p/33881#M24791</guid>
      <dc:creator>as999</dc:creator>
      <dc:date>2021-12-02T16:58:55Z</dc:date>
    </item>
    <item>
      <title>Re: python dataframe or hiveSql update based on predecessor value?</title>
      <link>https://community.databricks.com/t5/data-engineering/python-dataframe-or-hivesql-update-based-on-predecessor-value/m-p/33882#M24792</link>
      <description>&lt;P&gt;basically you have to create a dataframe (or use a window function, that will also work) which gives you the group combination with the most occurances.  So a window/groupby on object, name, shape with a count().&lt;/P&gt;&lt;P&gt;Then you have to determine which shape has the max(count) for a object/name combo.&lt;/P&gt;&lt;P&gt;can also be done using groupby or window.&lt;/P&gt;&lt;P&gt;Finally you filter on this max et voila.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;If you use window functions you can avoid a join I think (doin this out of my head).&lt;/P&gt;</description>
      <pubDate>Fri, 03 Dec 2021 07:43:12 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/python-dataframe-or-hivesql-update-based-on-predecessor-value/m-p/33882#M24792</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2021-12-03T07:43:12Z</dc:date>
    </item>
  </channel>
</rss>

