Running Keras model training with HorovodRunner works until the training function is exited ("The MPI_Query_thread() function was called after MPI_FINALIZE was invoked.")
I am running training of a Keras/Tensorflow deep learning model on a cluster of (for now) 2 workers and 1 driver (T4 GPU, 28GB, 4 core) using the Databricks provided HorovodRunner. It all seems to go well and the performance scales quite nicely over ...
- 1788 Views
- 2 replies
- 0 kudos
Latest Reply
I personally suspect it's your callbacks. Can you remove all those state callbacks and see if that is it?
- 0 kudos