Hey Guys,
I have few question that i hope you can help me with.
I start to train pytorch model in distributed training using petastorm + Horovod like databricks suggest in docs.
Q 1:
I can see that each worker is train the model, but when epochs are done Im getting an error and the job fails.
Im not sure what causing the error, some suspicious lines i found in logs:
[1,0]:terminate called without an active exception
[1,0]:[***] *** Process received signal ***
[1,0]:[***] Signal: Aborted (6)
[1,0]:[***] Signal code: (-6)
[1,0]:/tmp/HorovodRunner_xx4/launch.sh: line 8: 843 Aborted PYTHONPATH=$path /local_disk0/.ephemeral_nfs/envs/pythonEnv-xx/bin/python -c "from pyspark import cloudpickle; cloudpickle.load(open('func.pkl', 'rb'))(rank=$rank)"
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[43658,1],0]
Exit code: 134
Q2:
What is the best practice to log metrics,mode,params and all the wanted information?
While i tried to use mlflow.autolog() i saw nothing has been logged, and when tried mlflow.pytorch.autolog() i got an error i dont have pytorch lightning module ( The model is pytorch model)
Q3:
As part of transformation im using imgaug lib, if i installed this on driver and import in training function, does this module will be exists in the workers too?
EDIT :
Q1 : Although the error wasn't so much specific, I'm pretty sure this error appeared because I had an un-pickable object in training function.
After modification no errors appeared and .
About the rest of the questions - I still didn't manage to Solve.
Any feedback will be appreciated.
Thanks!