topic Re: Pyspark dataframe column comparison in Data Engineering

Pyspark dataframe column comparison

UmaMahesh1 — Thu, 01 Dec 2022 19:26:31 GMT

I have a string column which is a concatenation of elements with a hyphen as follows. Let 3 values from that column looks like below,

Row 1 - A-B-C-D-E-F

Row 2 - A-B-G-C-D-E-F

Row 3 - A-B-G-D-E-F

I want to compare 2 consecutive rows and create a column with what has changed. Specifically, 4 comparisons

So my output will look like this

Row1 ; null

Row2 : G added

Row3 : C Removed

any ideas/suggestions ?

NhatHoang — Sat, 03 Dec 2022 04:03:13 GMT

Hi,

I think you can follow these steps:

1. Use window function to create a new column by shifting, then your df will look like this

id value lag

1 A-B-C-D-E-F null

2 A-B-G-C-D-E-F A-B-C-D-E-F

3 A-B-G-D-E-F A-B-G-C-D-E-F

2. Use split() to convert string to array.

id value lag

1 ['A', 'B', 'C', 'D', 'E', 'F'] null

2 ['A', 'B', 'G', 'C', 'D', 'E', 'F'] ['A', 'B', 'C', 'D', 'E', 'F']

3 ['A', 'B', 'G', 'D', 'E', 'F'] ['A', 'B', 'G', 'C', 'D', 'E', 'F']

3. Create a column using array_except('value', 'lag') to find element in column 'value' but not in column 'lag'

4. Create a column using array_except('lag', 'value') to find element in column 'lag' but not in column 'value'

5. Then, you concat these two columns above, you will have the comparison.

Hope it works.

UmaMahesh1 — Sat, 03 Dec 2022 07:59:46 GMT

Hi @Nhat Hoang

Thanks for the answer.

Cheers..