Main Content

Analyze and Model Data on GPU

You can often improve code performance with execution on a graphical processing unit (GPU). For example, execution on a GPU can improve performance if:

  • Your code is computationally expensive, where computing time significantly exceeds the time spent transferring data to and from GPU memory.

  • Your workflow uses functions with gpuArray (Parallel Computing Toolbox) support with large array inputs.

When writing code for the GPU, it is best to start with code that already performs well on the CPU. Vectorization is usually critical for achieving high performance on the GPU. Convert code to use functions that support GPU array arguments and transfer the input data to the GPU. For more information about MATLAB functions with GPU array inputs, see Run MATLAB Functions on a GPU (Parallel Computing Toolbox).

Many functions in Statistics and Machine Learning Toolbox™ automatically execute on the GPU when you use GPU array input data. For example, you can create a probability distribution object on the GPU, where the output is a GPU array.

pd = fitdist(gpuArray(x),"Normal")

Using a GPU requires Parallel Computing Toolbox™ and a supported GPU device. For information about supported devices, see GPU Support by Release (Parallel Computing Toolbox). For the complete list of Statistics and Machine Learning Toolbox™ functions that accept GPU arrays, see Functions.

Query Properties of GPU

You can query and select your GPU device using the gpuDevice function. If you have multiple GPUs, you can examine the properties of all GPUs detected in your system with the gpuDeviceTable function. Then, you can select a specific GPU for single-GPU execution by using its index (gpuDevice(index)).

D = gpuDevice
D = 
  CUDADevice with properties:

                      Name: 'Tesla V100-PCIE-32GB'
                     Index: 1
         ComputeCapability: '7.0'
            SupportsDouble: 1
             DriverVersion: 11.2000
            ToolkitVersion: 11
        MaxThreadsPerBlock: 1024
          MaxShmemPerBlock: 49152
        MaxThreadBlockSize: [1024 1024 64]
               MaxGridSize: [2.1475e+09 65535 65535]
                 SIMDWidth: 32
               TotalMemory: 3.4090e+10
           AvailableMemory: 3.3374e+10
       MultiprocessorCount: 80
              ClockRateKHz: 1380000
               ComputeMode: 'Default'
      GPUOverlapsTransfers: 1
    KernelExecutionTimeout: 0
          CanMapHostMemory: 1
           DeviceSupported: 1
           DeviceAvailable: 1
            DeviceSelected: 1

Execute Function on GPU

Explore a data distribution on the GPU using descriptive statistics.

Generate a data set of normally distributed random numbers on the GPU.

dist = randn(1e5,1e4,"gpuArray");

Determine whether dist is a GPU array.

TF = isgpuarray(dist)
TF = logical

Execute a function with a GPU array input argument. For example, calculate the sample skewness for each column in dist. Since dist is a GPU array, the skewness function executes on the GPU and returns the result as a GPU array.

skew = skewness(dist);

Verify that the output skew is a GPU array.

TF = isgpuarray(skew)
TF = logical

Evaluate Speedup of GPU Execution

Evaluate function execution time on the GPU and compare performance with execution on the CPU.

Comparing the time taken to execute code on the CPU and the GPU can be useful to select the execution environment. For example, if you want to compute descriptive statistics from sample data, considering the execution time and the data transfer time is important to evaluating the overall performance. If a function has gpuArray support, as the number of observations increases, computation on the GPU generally becomes more performant compared to the CPU.

Measure the function run time in seconds by using the gputimeit (Parallel Computing Toolbox) function. gputimeit is preferable to timeit for functions that use the GPU because it ensures operation completion and compensates for overhead.

skew = @() skewness(dist);
t = gputimeit(skew)
t = 0.6270

Evaluate the performance difference between the GPU and CPU by independently measuring the CPU execution time. For this GPU, execution of this code is faster than execution on the CPU.

The performance of code on a GPU is heavily dependent on the GPU used. For additional information about measuring and improving GPU performance, see Measure and Improve GPU Performance (Parallel Computing Toolbox).

Single Precision on GPU

You can improve the performance of your code by doing your calculations in single precision instead of double precision.

Determine the execution time of the skewness function with an input argument of the dist data set in single precision.

dist_single = single(dist);
skew_single = @() skewness(dist_single);
t_single = gputimeit(skew_single)
t_single = 0.2206

For this GPU, execution of this code with single precision data is faster than execution with double precision data.

The performance improvement is dependent on the GPU card and total number of cores. For more information about using single precision with the GPU, see Measure and Improve GPU Performance (Parallel Computing Toolbox).

Dimensionality Reduction and Model Fitting on GPU

Implement dimensionality reduction and classification workflows on a GPU.

Functions such as pca and fitcensemble can be used together to efficiently train a machine learning model.

  • The principal component analysis (PCA) function reduces data dimensionality by replacing several correlated variables with a new set of variables that are linear combinations of the original variables.

  • The fitcensemble function fits many classification learners to form an ensemble model that can make better predictions than a single learner.

Both functions are computationally intensive and can be significantly accelerated using the GPU.

For an example, use the humanactivity data set. The data set contains 24,075 observations of five different physical human activities: sitting, standing, walking, running, and dancing. Each observation has 60 features extracted from acceleration data measured by smartphone accelerometer sensors. The data set contains the following variables:

  • actid — Response vector containing the activity IDs in integers: 1, 2, 3, 4, and 5 representing sitting, standing, walking, running, and dancing, respectively

  • actnames — Activity names corresponding to the integer activity IDs

  • feat — Feature matrix of 60 features for 24,075 observations

  • featlabels — Labels of the 60 features

load humanactivity

Use 90% of the observations to train a model that classifies the five types of human activities, and use 10% of the observations to validate the trained model. Use cvpartition to specify a 10% holdout for the test set.

Partition = cvpartition(actid,"Holdout",0.10);
trainingInds = training(Partition); % Indices for the training set
testInds = test(Partition); % Indices for the test set

Transfer the training and test data to the GPU.

XTrain = gpuArray(feat(trainingInds,:));
YTrain = gpuArray(actid(trainingInds));
XTest = gpuArray(feat(testInds,:));
YTest = gpuArray(actid(testInds));

Find the principal components for the training data set XTrain.

[coeff,score,~,~,explained,mu] = pca(XTrain);

Find the number of components required to explain at least 99% of variability.

idx = find(cumsum(explained)>99,1);

Determine the principal component scores that represent X in the principal component space.

XTrainPCA = score(:,1:idx);

Fit an ensemble of learners for classification.

template = templateTree("MaxNumSplits",20,"Reproducible",true);
classificationEnsemble = fitcensemble(XTrainPCA,YTrain, ...
    "Method","AdaBoostM2", ...
    "NumLearningCycles",30, ...
    "Learners",template, ...
    "LearnRate",0.1, ...
    "ClassNames",[1; 2; 3; 4; 5]);

To use the trained model for the test set, you need to transform the test data set by using the PCA obtained from the training data set.

XTestPCA = (XTest-mu)*coeff(:,1:idx);

Evaluate the accuracy of the trained classifier with the test data.

classificationError = loss(classificationEnsemble,XTestPCA,YTest);

Transfer to Local Workspace

Transfer data or model properties from the GPU to the local workspace for use with a function that does not support GPU arrays.

Transferring GPU arrays can be costly and is generally not necessary unless you need to use your result with functions that do not support GPU arrays or in another workspace where a GPU is unavailable.

The gather (Parallel Computing Toolbox) function transfers data from the GPU into the local workspace. Gather the dist data and confirm that the data is no longer a GPU array.

dist = gather(dist);
TF = isgpuarray(dist)
TF = logical

The gather function transfers properties of a machine learning model from the GPU into the local workspace. Gather the classificationEnsemble model and confirm that the model properties which were previously a GPU array, such as X, are no longer GPU arrays.

classificationEnsemble = gather(classificationEnsemble);
TF = isgpuarray(classificationEnsemble.X)
TF = logical

See Also

(Parallel Computing Toolbox) | (Parallel Computing Toolbox) | (Parallel Computing Toolbox)

Related Topics