- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-01-2022 11:26 AM
I have a string column which is a concatenation of elements with a hyphen as follows. Let 3 values from that column looks like below,
Row 1 - A-B-C-D-E-F
Row 2 - A-B-G-C-D-E-F
Row 3 - A-B-G-D-E-F
I want to compare 2 consecutive rows and create a column with what has changed. Specifically, 4 comparisons
- if first element changed
- last element changed
- elements Added when taking all except first and last
- elements removed when taking all except first and last
So my output will look like this
Row1 ; null
Row2 : G added
Row3 : C Removed
any ideas/suggestions ?
- Labels:
-
Pyspark
-
String Column
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-02-2022 08:03 PM
Hi,
I think you can follow these steps:
1. Use window function to create a new column by shifting, then your df will look like this
id value lag
1 A-B-C-D-E-F null
2 A-B-G-C-D-E-F A-B-C-D-E-F
3 A-B-G-D-E-F A-B-G-C-D-E-F
2. Use split() to convert string to array.
id value lag
1 ['A', 'B', 'C', 'D', 'E', 'F'] null
2 ['A', 'B', 'G', 'C', 'D', 'E', 'F'] ['A', 'B', 'C', 'D', 'E', 'F']
3 ['A', 'B', 'G', 'D', 'E', 'F'] ['A', 'B', 'G', 'C', 'D', 'E', 'F']
3. Create a column using array_except('value', 'lag') to find element in column 'value' but not in column 'lag'
4. Create a column using array_except('lag', 'value') to find element in column 'lag' but not in column 'value'
5. Then, you concat these two columns above, you will have the comparison.
Hope it works.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-02-2022 08:03 PM
Hi,
I think you can follow these steps:
1. Use window function to create a new column by shifting, then your df will look like this
id value lag
1 A-B-C-D-E-F null
2 A-B-G-C-D-E-F A-B-C-D-E-F
3 A-B-G-D-E-F A-B-G-C-D-E-F
2. Use split() to convert string to array.
id value lag
1 ['A', 'B', 'C', 'D', 'E', 'F'] null
2 ['A', 'B', 'G', 'C', 'D', 'E', 'F'] ['A', 'B', 'C', 'D', 'E', 'F']
3 ['A', 'B', 'G', 'D', 'E', 'F'] ['A', 'B', 'G', 'C', 'D', 'E', 'F']
3. Create a column using array_except('value', 'lag') to find element in column 'value' but not in column 'lag'
4. Create a column using array_except('lag', 'value') to find element in column 'lag' but not in column 'value'
5. Then, you concat these two columns above, you will have the comparison.
Hope it works.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-02-2022 11:59 PM
Hi @Nhat Hoang
Thanks for the answer.
Cheers..

