Databricks Community

UmaMahesh1 · ‎12-01-2022

I have a string column which is a concatenation of elements with a hyphen as follows. Let 3 values from that column looks like below,

Row 1 - A-B-C-D-E-F

Row 2 - A-B-G-C-D-E-F

Row 3 - A-B-G-D-E-F

I want to compare 2 consecutive rows and create a column with what has changed. Specifically, 4 comparisons

if first element changed
last element changed
elements Added when taking all except first and last
elements removed when taking all except first and last

So my output will look like this

Row1 ; null

Row2 : G added

Row3 : C Removed

any ideas/suggestions ?

Uma Mahesh D

NhatHoang · ‎12-02-2022

Hi,

I think you can follow these steps:

1. Use window function to create a new column by shifting, then your df will look like this

id value lag

1 A-B-C-D-E-F null

2 A-B-G-C-D-E-F A-B-C-D-E-F

3 A-B-G-D-E-F A-B-G-C-D-E-F

2. Use split() to convert string to array.

id value lag

1 ['A', 'B', 'C', 'D', 'E', 'F'] null

2 ['A', 'B', 'G', 'C', 'D', 'E', 'F'] ['A', 'B', 'C', 'D', 'E', 'F']

3 ['A', 'B', 'G', 'D', 'E', 'F'] ['A', 'B', 'G', 'C', 'D', 'E', 'F']

3. Create a column using array_except('value', 'lag') to find element in column 'value' but not in column 'lag'

4. Create a column using array_except('lag', 'value') to find element in column 'lag' but not in column 'value'

5. Then, you concat these two columns above, you will have the comparison.

Hope it works.

View solution in original post

NhatHoang · ‎12-02-2022

Hi,

I think you can follow these steps:

1. Use window function to create a new column by shifting, then your df will look like this

id value lag

1 A-B-C-D-E-F null

2 A-B-G-C-D-E-F A-B-C-D-E-F

3 A-B-G-D-E-F A-B-G-C-D-E-F

2. Use split() to convert string to array.

id value lag

1 ['A', 'B', 'C', 'D', 'E', 'F'] null

2 ['A', 'B', 'G', 'C', 'D', 'E', 'F'] ['A', 'B', 'C', 'D', 'E', 'F']

3 ['A', 'B', 'G', 'D', 'E', 'F'] ['A', 'B', 'G', 'C', 'D', 'E', 'F']

3. Create a column using array_except('value', 'lag') to find element in column 'value' but not in column 'lag'

4. Create a column using array_except('lag', 'value') to find element in column 'lag' but not in column 'value'

5. Then, you concat these two columns above, you will have the comparison.

Hope it works.

UmaMahesh1 · ‎12-02-2022

Hi @Nhat Hoang

Thanks for the answer.

Cheers..

Uma Mahesh D

Databricks Community

Pyspark dataframe column comparison

Photos

Join Us as a Local Community Builder!

Virtual Learning Festival: 9 April - 30 April

Intelligent Data Warehousing: AI/BI for Self-service Analytics

Get Started With Lakehouse Architecture | Pass a quiz to earn your certificate completion.

Data + AI Summit 2025 — registration now open!

Databricks Community Champion - March 2025 - Takuya Omi