Hi,
I'm currently facing an issue with task retries on a structured streaming set to have unlimited retries. Our job crashes frequently due to out of memory problems, so we set the task retry limit to -1 (unlimited) as a workaround. This is also the suggested practice for production streaming jobs, according to the Databricks documentation.
However, quite often after a couple of crashes, the cluster fails to find the dependencies. We have specifically setup a custom one, which loads fine on first runs. This bug does not always happen, so it's not so easy to reproduce.
This is the error I see:
: java.lang.ClassNotFoundException:
Failed to find data source: solacemqtt. Please find packages at
https://spark.apache.org/third-party-projects.html