We are having issues with checkpoints and schema versions getting out of date (no idea why), but it causes jobs to fail. We have jobs that are running 15-30 streaming queries, so if one fails, that creates an issue. I would like to trap the checkpoint errors, and just reset the checkpoint and log a failure. Not optimal because then I'm reprocessing the stream or at least the window we are looking at.
So the problem I have is it seems the only way to error trap the stream is to use an awaitTermination()... this locks up notebook and the next streams won't start until the first stream is terminated. The awaitAnyTermination() won't catch when the job starts up and an error occurs because the job stops before hitting the awaitAnyTermination?