OK so I'm trying to determine if a timestamp column has a regular interval or not, i.e. the difference between each consecutive value is the same across the entire column.
I tried something like this
val timeColumn: String =
val groupByColumn: String =
val window = Window.orderBy(col(timeColumn).asc).partitionBy(groupByColumn)
val intervalsDF = inputDF.select(col(timeColumn) - lag(col(timeColumn), 1).over(window))
val intervals = intervalsDF.distinct().collect().filterNot(_.isNullAt(0))
val isRegular = intervals.length == 1
This works only if the timestamps are in the hours, minutes, or seconds. But won't work for instance, if I have monthly timestamps like this
|time|c1|c2|c3| ...|
|-|-|-|-|-|
|2020-01-01 00:00:00|||| ...|
|2020-02-01 00:00:00|||| ...|
|2020-03-01 00:00:00|||| ...|
|2020-04-01 00:00:00|||| ...|
The logic above won't work because the number of days in January is different from the one in February, so it will think the timestamps are not of the same interval.
What other approach should I try to check if the timestamps follow a regular interval or not?