@minhhung0507 wrote:
We're facing a persistent issue with our production streaming pipelines where executors are being forcefully killed with the following error:
Executor got terminated abnormally due to FORCE_KILL
I solved the issue in our case and also think that I know now why it happened in the first place. In our DLT workload the amount of state information is unusually (at least i think so) high compared to the volume of data that we are processing in total. I learned meanwhile that RocksDB (in which the state information is stored) operates outside JVM, meaning it uses the workers non-heap memory. And if the non-heap memory consumption goes up too high your worker will just be killed by its OS.
I changed several settings in spark.conf, so i can't tell you which one exactly solved our issue, or if it was all of them in combination, but here is what i changed:
- allocated more non-heap memory to workers (see Spark documentation)
- limited RocksDB memory usage and tuned some other RocksDB related settings
The settings I'm talking about in (2) can be found here:
rocksdb-101-optimizing-stateful-streaming-in-apache-spark-with-amazon-emr-and-aws-glue
I found this resource extremely helpful for the explanations it provides as well as the suggested "defaults".