Databricks

Orianh · ‎01-19-2023

Hey Guys,

I have few question that i hope you can help me with.

I start to train pytorch model in distributed training using petastorm + Horovod like databricks suggest in docs.

Q 1:

I can see that each worker is train the model, but when epochs are done Im getting an error and the job fails.

Im not sure what causing the error, some suspicious lines i found in logs:

[1,0]:terminate called without an active exception
[1,0]:[***] *** Process received signal ***
[1,0]:[***] Signal: Aborted (6)
[1,0]:[***] Signal code:  (-6)
 
[1,0]:/tmp/HorovodRunner_xx4/launch.sh: line 8:   843 Aborted                 PYTHONPATH=$path /local_disk0/.ephemeral_nfs/envs/pythonEnv-xx/bin/python -c "from pyspark import cloudpickle; cloudpickle.load(open('func.pkl', 'rb'))(rank=$rank)"
 
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
 
  Process name: [[43658,1],0]
  Exit code:    134

Q2:

What is the best practice to log metrics,mode,params and all the wanted information?

While i tried to use mlflow.autolog() i saw nothing has been logged, and when tried mlflow.pytorch.autolog() i got an error i dont have pytorch lightning module ( The model is pytorch model)

Q3:

As part of transformation im using imgaug lib, if i installed this on driver and import in training function, does this module will be exists in the workers too?

EDIT :

Q1 : Although the error wasn't so much specific, I'm pretty sure this error appeared because I had an un-pickable object in training function.

After modification no errors appeared and .

About the rest of the questions - I still didn't manage to Solve.

Any feedback will be appreciated.

Thanks!

Anonymous · ‎04-10-2023

@orian hindi :

Regarding your questions:

Q1: The error message you are seeing is likely related to a segmentation fault, which can occur due to various reasons such as memory access violations or stack overflows. It could be caused by several factors, including a bug in your code or issues with your cluster's configuration. I recommend checking the resource allocation for your workers, such as the memory, CPU, and GPU resources, to ensure that they are appropriate for your training workload. Also, check if the version compatibility between PyTorch, Horovod, and Petastorm are correct. You may also want to look into using distributed PyTorch's

DistributedDataParallel (DDP) module instead of Horovod for distributed training.

Q2: If you are using PyTorch, you can manually log metrics and parameters to MLflow by using the mlflow.log_metric() and mlflow.log_param() functions. You can call these functions during training and evaluation to log the relevant metrics and parameters. Additionally, you can use MLflow's tracking UI to log and view the results of your training runs. For PyTorch models, you can use the mlflow.pytorch.log_model() function to log your model after training.

Q3: If you installed imgaug on the driver node, it will not automatically be available on the worker nodes. You will need to ensure that the imgaug package is installed on all nodes in your cluster that are running your training code. One way to do this is to include imgaug in your environment setup script or use a container that has imgaug installed.

Databricks

MLflow log pytorch distributed training

How to successfully build GenAI applications

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI