Your code is loading the full model into a single GPU so having multiple GPUs does not prevent out of memory errors. By default, transformer models only have DDP (distributed data parallel) so each GPU has a copy of your model for speeding up trainin...