Deep Learning with MATLAB on Multiple GPUs
MATLAB® supports training a single deep neural network using multiple GPUs in parallel. By using parallel workers with GPUs, you can train with multiple GPUs on your local machine, on a cluster, or on the cloud. Using multiple GPUs can speed up training significantly. To decide if you expect multi-GPU training to deliver a performance gain, consider the following factors:
How long is the iteration on each GPU? If each GPU iteration is short, then the added overhead of communication between GPUs can dominate. Try increasing the computation per iteration by using a larger batch size.
Are all the GPUs on a single machine? Communication between GPUs on different machines introduces a significant communication delay. You can mitigate this if you have suitable hardware. For more information, see Advanced Support for Fast Multi-Node GPU Communication.
Tip
To train a single network using multiple GPUs on your local machine, you can
simply specify the ExecutionEnvironment
option as
"multi-gpu"
without changing the rest of your code. trainNetwork
automatically uses your available GPUs for training
computations.
When you train on a remote cluster, specify the
ExecutionEnvironment
option as "parallel"
.
If the cluster has access to one or more GPUs, then
trainNetwork
only uses the GPUs for training. Workers
without a unique GPU are never used for training computation.
If you want to use more resources, you can scale up deep learning training to clusters or the cloud. To learn more about parallel options, see Scale Up Deep Learning in Parallel, on GPUs, and in the Cloud. To try an example, see Train Network in the Cloud Using Automatic Parallel Support.
Using a GPU or parallel options requires Parallel Computing Toolbox™. Using a GPU also requires a supported GPU device. For information on supported devices, see GPU Computing Requirements (Parallel Computing Toolbox). Using a remote cluster also requires MATLAB Parallel Server™.
Use Multiple GPUs in Local Machine
Note
If you run MATLAB on a single machine in the cloud that you connect to via ssh or remote desktop protocol (RDP), then network execution and training uses the same code as if you were running on your local machine.
If you have access to a machine with multiple GPUs, you can simply specify the
ExecutionEnvironment
option as
"multi-gpu"
:
For training using
trainNetwork
, use thetrainingOptions
function to set theExecutionEnvironment
name-value option to"multi-gpu"
.For inference using
classify
andpredict
, set theExecutionEnvironment
name-value option to"multi-gpu"
.
The "multi-gpu"
option allows you to use multiple GPUs in a
local parallel pool. If there is no current parallel pool,
trainNetwork
, predict
, and
classify
automatically start a local parallel pool using
your default cluster profile settings. The pool has as many workers as the number of
available GPUs.
For information on how to perform custom training using multiple GPUs in your local machine, see Run Custom Training Loops on a GPU and in Parallel.
Use Multiple GPUs in Cluster
For training and inference with multiple GPUs in a remote cluster, use the
"parallel"
option:
For training using
trainNetwork
, use thetrainingOptions
function to set theExecutionEnvironment
name-value option to"parallel"
.For inference using
classify
andpredict
, set theExecutionEnvironment
name-value option to"parallel"
.
If there is no current parallel pool, trainNetwork
,
predict
, and classify
automatically
start a parallel pool using your default cluster profile settings. If the pool has
access to GPUs, then only workers with a unique GPU perform training computation. If
the pool does not have GPUs, then training takes place on all available CPU workers
instead.
For information on how to perform custom training using multiple GPUs in a remote cluster, see Run Custom Training Loops on a GPU and in Parallel.
Optimize Mini-Batch Size and Learning Rate
Convolutional neural networks are typically trained iteratively using mini-batches
of images. This is because the whole dataset is usually too large to fit into GPU
memory. For optimum performance, you can experiment with the mini-batch size by
changing the MiniBatchSize
name-value option using the
trainingOptions
function.
The optimal mini-batch size depends on your exact network, dataset, and GPU hardware. When training with multiple GPUs, each image batch is distributed between the GPUs. This effectively increases the total GPU memory available, allowing larger batch sizes. A recommended practice is to scale up the mini-batch size linearly with the number of GPUs, in order to keep the workload on each GPU constant. For example, if you are training on a single GPU using a mini-batch size of 64, and you want to scale up to training with four GPUs of the same type, you can increase the mini-batch size to 256 so that each GPU processes 64 observations per iteration.
Because increasing the mini-batch size improves the significance of each iteration, you can increase the learning rate. A good general guideline is to increase the learning rate proportionally to the increase in mini-batch size. Depending on your application, a larger mini-batch size and learning rate can speed up training without a decrease in accuracy, up to some limit.
Select Particular GPUs to Use for Training
If you do not want to use all of your GPUs, you can select the GPUs that you want to use for training and inference directly. Doing so can be useful to avoid training on a poor-performance GPU, for example, your display GPU.
If your GPUs are in your local machine, you can use the gpuDeviceTable
(Parallel Computing Toolbox) and gpuDeviceCount
(Parallel Computing Toolbox) functions to
examine your GPU resources and determine the index of the GPUs you want to use.
For single GPU training with the "auto"
or
"gpu"
options, by default, MATLAB uses the GPU device with index 1
. You can use a
different GPU by selecting the device before you start training. Use gpuDevice
(Parallel Computing Toolbox) to select the desired
GPU using its
index:
gpuDevice(index)
trainNetwork
,
predict
, and classify
automatically
use the selected GPU when you set the ExecutionEnvironment
option
to "auto"
or "gpu"
.For multiple GPU training with the "multi-gpu"
option, by
default, MATLAB uses all available GPUs in your local machine. If you want to exclude
GPUs, you can start the parallel pool in advance and select the devices manually.
For example, suppose you have three GPUs but you only want to use the devices with
indices 1
and 3
. You can use the following
code to start a parallel pool with two workers and select one GPU on each
worker.
useGPUs = [1 3]; parpool("Processes", numel(useGPUs)); spmd gpuDevice(useGPUs(spmdIndex)); end
trainNetwork
, predict
, and
classify
automatically use the current parallel pool when
you set the ExecutionEnvironment
option to
"multi-gpu"
(or "parallel"
for the same
result).
Another option is to select workers using the WorkerLoad
name-value argument in trainingOptions
. For
example:
parpool("Processes", 5); opts = trainingOptions('sgdm', 'WorkerLoad', [1 1 1 0 1], ...)
gpuDevice
.Train Multiple Networks on Multiple GPUs
If you want to train multiple models in parallel with one GPU each, start a
parallel pool with one worker per available GPU, and train each network on a
different worker. Use parfor
or parfeval
to simultaneously execute a network on each worker. Use the trainingOptions
function to set the
ExecutionEnvironment
name-value option to
"gpu"
on each worker.
For example, use code of the following form to train multiple networks in parallel on all available GPUs:
options = trainingOptions("sgdm","ExecutionEnvironment","gpu"); parfor i=1:gpuDeviceCount("available") trainNetwork(…,options); end
To run in the background without blocking your local MATLAB, use parfeval
. For examples showing how to train
multiple networks using parfor
and
parfeval
, see
Advanced Support for Fast Multi-Node GPU Communication
Some multi-GPU features in MATLAB, including trainNetwork
, are optimized for direct communication via fast interconnects for improved performance.
If you have appropriate hardware connections, then data transfer between multiple GPUs uses fast peer-to-peer communication, including NVLink, if available.
If you are using a Linux compute cluster with fast interconnects between machines such as Infiniband, or fast interconnects between GPUs on different machines, such as GPUDirect RDMA, you might be able to take advantage of fast multi-node support in MATLAB. Enable this support on all the workers in your pool by setting the environment variable PARALLEL_SERVER_FAST_MULTINODE_GPU_COMMUNICATION
to 1
. Set this environment variable in the Cluster Profile Manager.
This feature is part of the NVIDIA NCCL library for GPU communication. To configure it, you must set additional environment variables to define the network interface protocol, especially NCCL_SOCKET_IFNAME
. For more information, see the NCCL documentation and in particular the section on NCCL Environment Variables.
See Also
trainNetwork
| trainingOptions
| gpuDevice
(Parallel Computing Toolbox) | spmd
(Parallel Computing Toolbox) | imageDatastore