Our pipelines sometimes get stuck (example).
Some workers get decommissioned due to spot termination and then the new workers get added.
However, after (1) Spark doesn't notice new executors:
And I don't know why. I don't understand how to debug this, but here're some of my observations:
* The init script logs of the workers, which Spark doesn't notice, are fine, they complete successfully.
* The driver logs don't show anything significant after old executors get decomissioned. Driver simply doesn't notice new executors
How do I debug this and what can be the issue?
Sergey