The same Scoring notebook is run 3 times as 3 Jobs. The jobs are identical, same PetaStorm code, CPU cluster config ( not Spot cluster) and data but have varying elapsed runtimes. Elapsed runtimes - Job 1: 3 hours, Job 2 39 min, Job 3 42 min. What would cause the large Job 1 variation?
The same Scoring code is ported to run on GPU cluster. Updated environment and libraries were required. Same data. Runtimes: JobG 1 GPU single node cluster 67 min. JobG 2 multi-node 1-4 cluster 73 min. The JobG 2 multi-node did not run faster than JobG 1 single node. Looking a logs, the nodes run serially and not parallel. node 1 runs and stops, node 2 runs and stops, node 3 runs and stops, node 4 runs and stops. Is there a configuration parameter that is required for parallel node processing? Or another factor ?
Could you please confirm, if you have set any parameters for the best model? Is this stop after running some epochs if there is no improvement in the model performance?