Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
Showing results for 
Search instead for 
Did you mean: 

MLflow log pytorch distributed training

Valued Contributor II

Hey Guys,

I have few question that i hope you can help me with.

I start to train pytorch model in distributed training using petastorm + Horovod like databricks suggest in docs.

Q 1:

I can see that each worker is train the model, but when epochs are done Im getting an error and the job fails.

Im not sure what causing the error, some suspicious lines i found in logs:

[1,0]:terminate called without an active exception
[1,0]:[***] *** Process received signal ***
[1,0]:[***] Signal: Aborted (6)
[1,0]:[***] Signal code:  (-6)
[1,0]:/tmp/HorovodRunner_xx4/ line 8:   843 Aborted                 PYTHONPATH=$path /local_disk0/.ephemeral_nfs/envs/pythonEnv-xx/bin/python -c "from pyspark import cloudpickle; cloudpickle.load(open('func.pkl', 'rb'))(rank=$rank)"
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
  Process name: [[43658,1],0]
  Exit code:    134


What is the best practice to log metrics,mode,params and all the wanted information?

While i tried to use mlflow.autolog() i saw nothing has been logged, and when tried mlflow.pytorch.autolog() i got an error i dont have pytorch lightning module ( The model is pytorch model)


As part of transformation im using imgaug lib, if i installed this on driver and import in training function, does this module will be exists in the workers too?


Q1 : Although the error wasn't so much specific, I'm pretty sure this error appeared because I had an un-pickable object in training function.

After modification no errors appeared and .

About the rest of the questions - I still didn't manage to Solve.

Any feedback will be appreciated.



Not applicable

@orian hindi​ :

Regarding your questions:

Q1: The error message you are seeing is likely related to a segmentation fault, which can occur due to various reasons such as memory access violations or stack overflows. It could be caused by several factors, including a bug in your code or issues with your cluster's configuration. I recommend checking the resource allocation for your workers, such as the memory, CPU, and GPU resources, to ensure that they are appropriate for your training workload. Also, check if the version compatibility between PyTorch, Horovod, and Petastorm are correct. You may also want to look into using distributed PyTorch's

DistributedDataParallel (DDP) module instead of Horovod for distributed training.

Q2: If you are using PyTorch, you can manually log metrics and parameters to MLflow by using the mlflow.log_metric() and mlflow.log_param() functions. You can call these functions during training and evaluation to log the relevant metrics and parameters. Additionally, you can use MLflow's tracking UI to log and view the results of your training runs. For PyTorch models, you can use the mlflow.pytorch.log_model() function to log your model after training.

Q3: If you installed imgaug on the driver node, it will not automatically be available on the worker nodes. You will need to ensure that the imgaug package is installed on all nodes in your cluster that are running your training code. One way to do this is to include imgaug in your environment setup script or use a container that has imgaug installed.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!