Databricks Community

RajNath · ‎03-25-2024

Sample Input data set

ClusterId	Event	EventTime
1212-18-r9u1kzn1	RUNNING	2024-02-02T11:38:30.168+00:00
1212-18-r9u1kzn1	TERMINATING	2024-02-02T13:43:33.933+00:00
1212-18-r9u1kzn1	STARTING	2024-02-02T15:50:05.174+00:00
1212-18-r9u1kzn1	RUNNING	2024-02-02T15:54:21.510+00:00
1212-18-r9u1kzn1	RUNNING	2024-02-02T16:09:20.576+00:00
1212-18-r9u1kzn1	RUNNING	2024-02-02T16:19:58.744+00:00
1212-18-r9u1kzn1	TERMINATING	2024-02-02T17:18:33.863+00:00
1212-18-r9u1kzn1	STARTING	2024-02-02T17:22:38.635+00:00
1212-18-r9u1kzn1	RUNNING	2024-02-02T17:23:40.781+00:00
1212-18-r9u1kzn1	TERMINATING	2024-02-02T18:03:33.953+00:00
1212-18-r9u1kzn1	STARTING	2024-02-02T21:10:21.651+00:00
1212-18-r9u1kzn1	RUNNING	2024-02-02T21:13:59.842+00:00
1212-18-r9u1kzn1	TERMINATING	2024-02-02T22:43:34.022+00:00

Below is sample expected output. In this RunningEventTime will show the event time corresponding to the previous running event time for the event "TERMINATING". In case "STARTING" event is present then for that event time should be showing in "StartingEventTime" column.

ClusterId	Event	EventTime	RunningEventTime	StartingEventTime
1212-18-r9u1kzn1	RUNNING	2024-02-02T11:38:30.168+00:00
1212-18-r9u1kzn1	TERMINATING	2024-02-02T13:43:33.933+00:00	2024-02-02T11:38:30.168+00:00
1212-18-r9u1kzn1	STARTING	2024-02-02T15:50:05.174+00:00
1212-18-r9u1kzn1	RUNNING	2024-02-02T15:54:21.510+00:00
1212-18-r9u1kzn1	RUNNING	2024-02-02T16:09:20.576+00:00
1212-18-r9u1kzn1	RUNNING	2024-02-02T16:19:58.744+00:00
1212-18-r9u1kzn1	TERMINATING	2024-02-02T17:18:33.863+00:00		2024-02-02T15:50:05.174+00:00
1212-18-r9u1kzn1	STARTING	2024-02-02T17:22:38.635+00:00
1212-18-r9u1kzn1	RUNNING	2024-02-02T17:23:40.781+00:00
1212-18-r9u1kzn1	TERMINATING	2024-02-02T18:03:33.953+00:00		2024-02-02T17:22:38.635+00:00
1212-18-r9u1kzn1	STARTING	2024-02-02T21:10:21.651+00:00
1212-18-r9u1kzn1	RUNNING	2024-02-02T21:13:59.842+00:00
1212-18-r9u1kzn1	TERMINATING	2024-02-02T22:43:34.022+00:00		2024-02-02T21:10:21.651+00:00

I tried few option such using self join but that is not ideal when data set is large. Another option i tried is looping but here also same problem. It will not be good for large data sets. I tried windowing function "lag" but could not make it work. Any suggestion or hint would be really helpful.

RajNath · ‎03-27-2024

This event data does not have specific patter, I can not group it based on interval. Only option what is see is self join or looping. But i want to avoid it, is there any other option for given data set?

Databricks Community

Traversing to previous rows and getting the data based on condition

Join Us as a Local Community Builder!

🚀 Weekly Delta (8 - 14 October): A Look Back at This Week’s Top Community Highlights

Databricks Community Champion - September 2025 - Nayanjyoti Sonowal

BrickCon 2025 — Dec 3–5 | A Community Conference for Databricks Builders

🌟 Community Sparks of the Week | September 26 – October 2 🌟

Solution Accelerator Series | #4 - Toxicity Detection for Gaming