cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Pyspark dataframe column comparison

UmaMahesh1
Honored Contributor III

I have a string column which is a concatenation of elements with a hyphen as follows. Let 3 values from that column looks like below,

Row 1 - A-B-C-D-E-F

Row 2 - A-B-G-C-D-E-F

Row 3 - A-B-G-D-E-F

I want to compare 2 consecutive rows and create a column with what has changed. Specifically, 4 comparisons

  1. if first element changed
  2. last element changed
  3. elements Added when taking all except first and last
  4. elements removed when taking all except first and last

So my output will look like this

Row1 ; null

Row2 : G added

Row3 : C Removed

any ideas/suggestions ?

1 ACCEPTED SOLUTION

Accepted Solutions

NhatHoang
Valued Contributor II

Hi,

I think you can follow these steps:

1. Use window function to create a new column by shifting, then your df will look like this

id value lag

1 A-B-C-D-E-F null

2 A-B-G-C-D-E-F A-B-C-D-E-F

3 A-B-G-D-E-F A-B-G-C-D-E-F

2. Use split() to convert string to array.

id value lag

1 ['A', 'B', 'C', 'D', 'E', 'F'] null

2 ['A', 'B', 'G', 'C', 'D', 'E', 'F'] ['A', 'B', 'C', 'D', 'E', 'F']

3 ['A', 'B', 'G', 'D', 'E', 'F'] ['A', 'B', 'G', 'C', 'D', 'E', 'F']

3. Create a column using array_except('value', 'lag') to find element in column 'value' but not in column 'lag'

4. Create a column using array_except('lag', 'value') to find element in column 'lag' but not in column 'value'

5. Then, you concat these two columns above, you will have the comparison.

Hope it works.

View solution in original post

2 REPLIES 2

NhatHoang
Valued Contributor II

Hi,

I think you can follow these steps:

1. Use window function to create a new column by shifting, then your df will look like this

id value lag

1 A-B-C-D-E-F null

2 A-B-G-C-D-E-F A-B-C-D-E-F

3 A-B-G-D-E-F A-B-G-C-D-E-F

2. Use split() to convert string to array.

id value lag

1 ['A', 'B', 'C', 'D', 'E', 'F'] null

2 ['A', 'B', 'G', 'C', 'D', 'E', 'F'] ['A', 'B', 'C', 'D', 'E', 'F']

3 ['A', 'B', 'G', 'D', 'E', 'F'] ['A', 'B', 'G', 'C', 'D', 'E', 'F']

3. Create a column using array_except('value', 'lag') to find element in column 'value' but not in column 'lag'

4. Create a column using array_except('lag', 'value') to find element in column 'lag' but not in column 'value'

5. Then, you concat these two columns above, you will have the comparison.

Hope it works.

UmaMahesh1
Honored Contributor III

Hi @Nhat Hoang​ 

Thanks for the answer.

Cheers..

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.