<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Pyspark dataframe column comparison in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/pyspark-dataframe-column-comparison/m-p/19245#M12881</link>
    <description>&lt;P&gt;Hi @Nhat Hoang​&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks for the answer. &lt;/P&gt;&lt;P&gt;Cheers..&lt;/P&gt;</description>
    <pubDate>Sat, 03 Dec 2022 07:59:46 GMT</pubDate>
    <dc:creator>UmaMahesh1</dc:creator>
    <dc:date>2022-12-03T07:59:46Z</dc:date>
    <item>
      <title>Pyspark dataframe column comparison</title>
      <link>https://community.databricks.com/t5/data-engineering/pyspark-dataframe-column-comparison/m-p/19243#M12879</link>
      <description>&lt;P&gt;I have a string column which is a concatenation of elements with a hyphen as follows. Let 3 values from that column looks like below, &lt;/P&gt;&lt;P&gt;Row 1 - A-B-C-D-E-F&lt;/P&gt;&lt;P&gt;Row 2 - A-B-G-C-D-E-F&lt;/P&gt;&lt;P&gt;Row 3 - A-B-G-D-E-F&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I want to compare 2 consecutive rows and create a column with what has changed. Specifically, 4 comparisons&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;if first element changed &lt;/LI&gt;&lt;LI&gt;last element changed &lt;/LI&gt;&lt;LI&gt;elements Added when taking all except first and last&lt;/LI&gt;&lt;LI&gt;elements removed when taking all except first and last&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;So my output will look like this&lt;/P&gt;&lt;P&gt;Row1 ; null &lt;/P&gt;&lt;P&gt;Row2 : G added&lt;/P&gt;&lt;P&gt;Row3 : C Removed&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;any ideas/suggestions ?&lt;/P&gt;</description>
      <pubDate>Thu, 01 Dec 2022 19:26:31 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pyspark-dataframe-column-comparison/m-p/19243#M12879</guid>
      <dc:creator>UmaMahesh1</dc:creator>
      <dc:date>2022-12-01T19:26:31Z</dc:date>
    </item>
    <item>
      <title>Re: Pyspark dataframe column comparison</title>
      <link>https://community.databricks.com/t5/data-engineering/pyspark-dataframe-column-comparison/m-p/19244#M12880</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I think you can follow these steps:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;1.&lt;/B&gt; Use window function to create a new column by shifting, then your df will look like this&lt;/P&gt;&lt;P&gt;&lt;B&gt;id    value                        lag&lt;/B&gt;&lt;/P&gt;&lt;P&gt;&lt;I&gt;1      A-B-C-D-E-F         null&lt;/I&gt;&lt;/P&gt;&lt;P&gt;&lt;I&gt;2     A-B-G-C-D-E-F    A-B-C-D-E-F&lt;/I&gt;&lt;/P&gt;&lt;P&gt;&lt;I&gt;3     A-B-G-D-E-F         A-B-G-C-D-E-F&lt;/I&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;2.&lt;/B&gt; Use split() to convert string to array.&lt;/P&gt;&lt;P&gt;&lt;B&gt;id    value                               lag&lt;/B&gt;&lt;/P&gt;&lt;P&gt;&lt;I&gt;1      ['A', 'B', 'C', 'D', 'E', 'F']         null&lt;/I&gt;&lt;/P&gt;&lt;P&gt;&lt;I&gt;2     ['A', 'B', 'G', 'C', 'D', 'E', 'F']   ['A', 'B', 'C', 'D', 'E', 'F']&lt;/I&gt;&lt;/P&gt;&lt;P&gt;&lt;I&gt;3     ['A', 'B', 'G', 'D', 'E', 'F']        ['A', 'B', 'G', 'C', 'D', 'E', 'F']&lt;/I&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;3.&lt;/B&gt; Create a column using&lt;B&gt; array_except('value', 'lag')&lt;/B&gt; to find element in column 'value' but not in column 'lag'&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;4.&lt;/B&gt; Create a column using &lt;B&gt;array_except('lag', 'value')&lt;/B&gt; to find element in column 'lag' but not in column 'value'&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;5.&lt;/B&gt; Then, you &lt;B&gt;concat&lt;/B&gt; these two columns above, you will have the comparison.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Hope it works.&lt;/P&gt;</description>
      <pubDate>Sat, 03 Dec 2022 04:03:13 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pyspark-dataframe-column-comparison/m-p/19244#M12880</guid>
      <dc:creator>NhatHoang</dc:creator>
      <dc:date>2022-12-03T04:03:13Z</dc:date>
    </item>
    <item>
      <title>Re: Pyspark dataframe column comparison</title>
      <link>https://community.databricks.com/t5/data-engineering/pyspark-dataframe-column-comparison/m-p/19245#M12881</link>
      <description>&lt;P&gt;Hi @Nhat Hoang​&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks for the answer. &lt;/P&gt;&lt;P&gt;Cheers..&lt;/P&gt;</description>
      <pubDate>Sat, 03 Dec 2022 07:59:46 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pyspark-dataframe-column-comparison/m-p/19245#M12881</guid>
      <dc:creator>UmaMahesh1</dc:creator>
      <dc:date>2022-12-03T07:59:46Z</dc:date>
    </item>
  </channel>
</rss>

