Deep Learning with MATLAB on Multiple GPUs
MATLAB® supports training a single deep neural network using multiple GPUs in parallel. By using parallel workers with GPUs, you can train with multiple GPUs on your local machine, on a cluster, or on the cloud. Using multiple GPUs can speed up training significantly. To decide if you expect multi-GPU training to deliver a performance gain, consider the following factors:
- How long is the iteration on each GPU? If each GPU iteration is short, then the added overhead of communication between GPUs can dominate. Try increasing the computation per iteration by using a larger batch size. 
- Are all the GPUs on a single machine? Communication between GPUs on different machines introduces a significant communication delay. You can mitigate this if you have suitable hardware. For more information, see Advanced Support for Fast Multi-Node GPU Communication. 
Tip
To train a single network using multiple GPUs on your local machine, you can
                simply specify the ExecutionEnvironment option as
                    "multi-gpu" without changing the rest of your code. The
                    trainnet functions automatically uses your available GPUs for
                training computations. For an example showing how to train a network using multiple
                local GPUs, see Train Network Using Automatic Multi-GPU Support.
When you train on a remote cluster, specify the
                    ExecutionEnvironment option as
                    "parallel-auto". If the cluster has access to one or more
                GPUs, then trainnet only use the GPUs for training. Workers
                without a unique GPU are never used for training computation.
If you want to use more resources, you can scale up deep learning training to clusters or the cloud. To learn more about parallel options, see Scale Up Deep Learning in Parallel, on GPUs, and in the Cloud. To try an example, see Train Network in the Cloud Using Automatic Parallel Support.
Using a GPU or parallel options requires Parallel Computing Toolbox™. Using a GPU also requires a supported GPU device. For information on supported devices, see GPU Computing Requirements (Parallel Computing Toolbox). Using a remote cluster also requires MATLAB Parallel Server™.
Use Multiple GPUs in Local Machine
Note
If you run MATLAB on a single machine in the cloud that you connect to via ssh or remote desktop protocol (RDP), then network execution and training uses the same code as if you were running on your local machine.
If you have access to a machine with multiple GPUs, you can train a network using
                the trainnet function by setting the
                    ExecutionEnvironment training option to
                    "multi-gpu" using the trainingOptions function.
The "multi-gpu" option allows you to use multiple GPUs in a
                local parallel pool. If there is no current parallel pool,
                    trainnet automatically starts a local parallel pool using
                your default cluster profile settings. The pool has as many workers as the number of
                available GPUs.
For information on how to perform custom training using multiple GPUs in your local machine, see Run Custom Training Loops on a GPU and in Parallel.
Use Multiple GPUs in Cluster
For training with multiple GPUs in a remote cluster, set the
                    ExecutionEnvironment training option to
                    "parallel-auto" or "parallel-gpu" using
                the trainingOptions function.
If there is no current parallel pool, trainnet automatically
                starts a parallel pool using your default cluster profile settings. If the pool has
                access to GPUs, then only workers with a unique GPU perform training computation. If
                the pool does not have GPUs, then training takes place on all available CPU workers
                instead.
For information on how to perform custom training using multiple GPUs in a remote cluster, see Run Custom Training Loops on a GPU and in Parallel.
Optimize Mini-Batch Size and Learning Rate
Convolutional neural networks are typically trained iteratively using mini-batches
                of images. This is because the whole data set is usually too large to fit into GPU
                memory. For optimum performance, you can experiment with the mini-batch size by
                changing the MiniBatchSize option using the trainingOptions function.
The optimal mini-batch size depends on your exact network, data set, and GPU hardware. When training with multiple GPUs, each image batch is distributed between the GPUs. This effectively increases the total GPU memory available, allowing larger batch sizes. A recommended practice is to scale up the mini-batch size linearly with the number of GPUs, in order to keep the workload on each GPU constant. For example, if you are training on a single GPU using a mini-batch size of 64, and you want to scale up to training with four GPUs of the same type, you can increase the mini-batch size to 256 so that each GPU processes 64 observations per iteration.
Because increasing the mini-batch size improves the significance of each iteration, you can increase the learning rate. A good general guideline is to increase the learning rate proportionally to the increase in mini-batch size. Depending on your application, a larger mini-batch size and learning rate can speed up training without a decrease in accuracy, up to some limit.
You can use the Experiment Manager app to find optimal training options by sweeping through a range of hyperparameter values or by using Bayesian optimization. For more information on how to use Experiment Manager, see Create a Deep Learning Experiment for Classification.
Select Particular GPUs to Use for Training
If you do not want to use all of your GPUs, you can select the GPUs that you want to use for training and inference directly. Doing so can be useful to avoid training on a poor-performance GPU, for example, your display GPU.
If your GPUs are in your local machine, you can use the gpuDeviceTable (Parallel Computing Toolbox) and gpuDeviceCount (Parallel Computing Toolbox) functions to
                examine your GPU resources and determine the index of the GPUs you want to use. 
For single GPU training with the "auto" or
                    "gpu" options, by default, MATLAB uses the GPU device with index 1. You can use a
                different GPU by selecting the device before you start training. Use gpuDevice (Parallel Computing Toolbox) to select the desired
                GPU using its
                    index:
gpuDevice(index)
trainnet
                automatically uses the selected GPU when you set the
                    ExecutionEnvironment option to "auto" or
                    "gpu".For multiple GPU training with the "multi-gpu" option, by
                default, MATLAB uses all available GPUs in your local machine. If you want to exclude
                GPUs, you can start the parallel pool in advance and select the devices manually. 
For example, suppose you have three GPUs but you only want to use the devices with
                indices 1 and 3. You can use the following
                code to start a parallel pool with two workers and select one GPU on each
                worker.
useGPUs = [1 3]; parpool("Processes",numel(useGPUs)); spmd gpuDevice(useGPUs(spmdIndex)); end
trainnet automatically uses the current parallel pool when
                you set the ExecutionEnvironment option to
                    "multi-gpu" (or "parallel-auto" or
                    "parallel-gpu" for the same result).
Train Multiple Networks on Multiple GPUs
To train multiple models in parallel with one GPU each, start a parallel pool with
                one worker per available GPU, and train each network on a different worker. Use
                    parfor or parfeval to simultaneously
                execute a network on each worker. Use the trainingOptions function to set the
                    ExecutionEnvironment name-value option to
                    "gpu" on each worker.
For example, use code of the following form to train multiple networks in parallel on all available GPUs:
options = trainingOptions("sgdm",ExecutionEnvironment="gpu"); parfor i=1:gpuDeviceCount("available") trainnet(…,options); end
To run in the background without blocking your local MATLAB, use parfeval. For examples showing how to train
                multiple networks using parfor and
                    parfeval, see
Make Predictions Using Multiple GPUs
To make predictions in parallel using multiple GPUs, create a parallel pool with one worker per GPU, divide up your data, and make the predictions in parallel. For an example showing how to make predictions using multiple GPUs, see Train Network Using Automatic Multi-GPU Support.
Advanced Support for Fast Multi-Node GPU Communication
Some multi-GPU features in MATLAB, including the trainnet function, are optimized for
        direct communication via fast interconnects for improved performance. 
If you have appropriate hardware connections, then data transfer between multiple GPUs uses fast peer-to-peer communication, including NVLink, if available.
If you are using a Linux® compute cluster with fast interconnects between machines such as Infiniband,
        or fast interconnects between GPUs on different machines, such as GPUDirect RDMA, you might
        be able to take advantage of fast multi-node support in MATLAB. Enable this support on all the workers in your pool by setting the
        environment variable PARALLEL_SERVER_FAST_MULTINODE_GPU_COMMUNICATION to
            1. Set this environment variable in the Cluster Profile
        Manager.
This feature is part of the NVIDIA NCCL library for GPU communication. To configure it, you must set additional environment variables to define the network interface protocol, especially NCCL_SOCKET_IFNAME. For more information, see the NCCL documentation and in particular the section on NCCL Environment Variables.
See Also
trainnet | trainingOptions | dlnetwork | gpuDevice (Parallel Computing Toolbox) | spmd (Parallel Computing Toolbox) | imageDatastore