Databricks Community

UmaMahesh1 · ‎12-01-2022

I have a string column which is a concatenation of elements with a hyphen as follows. Let 3 values from that column looks like below,

Row 1 - A-B-C-D-E-F

Row 2 - A-B-G-C-D-E-F

Row 3 - A-B-G-D-E-F

I want to compare 2 consecutive rows and create a column with what has changed. Specifically, 4 comparisons

if first element changed
last element changed
elements Added when taking all except first and last
elements removed when taking all except first and last

So my output will look like this

Row1 ; null

Row2 : G added

Row3 : C Removed

any ideas/suggestions ?

Uma Mahesh D

NhatHoang · ‎12-02-2022

Hi,

I think you can follow these steps:

1. Use window function to create a new column by shifting, then your df will look like this

id value lag

1 A-B-C-D-E-F null

2 A-B-G-C-D-E-F A-B-C-D-E-F

3 A-B-G-D-E-F A-B-G-C-D-E-F

2. Use split() to convert string to array.

id value lag

1 ['A', 'B', 'C', 'D', 'E', 'F'] null

2 ['A', 'B', 'G', 'C', 'D', 'E', 'F'] ['A', 'B', 'C', 'D', 'E', 'F']

3 ['A', 'B', 'G', 'D', 'E', 'F'] ['A', 'B', 'G', 'C', 'D', 'E', 'F']

3. Create a column using array_except('value', 'lag') to find element in column 'value' but not in column 'lag'

4. Create a column using array_except('lag', 'value') to find element in column 'lag' but not in column 'value'

5. Then, you concat these two columns above, you will have the comparison.

Hope it works.

View solution in original post

NhatHoang · ‎12-02-2022

Hi,

I think you can follow these steps:

1. Use window function to create a new column by shifting, then your df will look like this

id value lag

1 A-B-C-D-E-F null

2 A-B-G-C-D-E-F A-B-C-D-E-F

3 A-B-G-D-E-F A-B-G-C-D-E-F

2. Use split() to convert string to array.

id value lag

1 ['A', 'B', 'C', 'D', 'E', 'F'] null

2 ['A', 'B', 'G', 'C', 'D', 'E', 'F'] ['A', 'B', 'C', 'D', 'E', 'F']

3 ['A', 'B', 'G', 'D', 'E', 'F'] ['A', 'B', 'G', 'C', 'D', 'E', 'F']

3. Create a column using array_except('value', 'lag') to find element in column 'value' but not in column 'lag'

4. Create a column using array_except('lag', 'value') to find element in column 'lag' but not in column 'value'

5. Then, you concat these two columns above, you will have the comparison.

Hope it works.

UmaMahesh1 · ‎12-02-2022

Hi @Nhat Hoang

Thanks for the answer.

Cheers..

Uma Mahesh D

Databricks Community

Pyspark dataframe column comparison

Databricks AMER Learning Festival | Virtual Training

Introducing the Genie Hub: Ask Questions, Share Builds, and Master Conversational Analytics

🌟 Community Pulse: Your Weekly Roundup! July 13 – 19, 2026

Solution Accelerator Series | Social Determinants of Health

Upcoming Community BrickTalk | Sports Analytics: Turning Tracking Data into Real-Time AI Decisions