Note: This page has been translated by MathWorks. Click here to see

To view all translated materials including this page, select Country from the country navigator on the bottom of this page.

To view all translated materials including this page, select Country from the country navigator on the bottom of this page.

This example looks at how we can benchmark the solving of a linear system by generating GPU code. The MATLAB® code to solve for `x`

in `A*x = b`

is very simple. Most frequently, we use matrix left division, also known as `mldivide`

or the backslash operator (\), to calculate `x`

(that is, `x = A\b`

).

CUDA®-enabled NVIDIA® GPU with compute capability 3.5 or higher.

NVIDIA CUDA toolkit.

Environment variables for the compilers and libraries. For more information see Environment Variables.

The following line of code creates a folder in your current working folder (pwd), and copies all the relevant files into this folder. If you do not want to perform this operation or if you cannot generate files in this folder, change your current working folder.

```
gpucoderdemo_setup('gpucoderdemo_backslash_bench');
```

Use the coder.checkGpuInstall function and verify that the compilers and libraries needed for running this example are set up correctly.

```
envCfg = coder.gpuEnvConfig('host');
envCfg.BasicCodegen = 1;
envCfg.Quiet = 1;
coder.checkGpuInstall(envCfg);
```

It is important to choose the appropriate matrix size for the computations. We can do this by specifying the amount of system memory in GB available to the CPU and the GPU. The default value is based only on the amount of memory available on the GPU, and you can specify a value that is appropriate for your system.

g = gpuDevice; maxMemory = 0.25*g.AvailableMemory/1024^3;

We want to benchmark matrix left division (\), including the cost of transferring data between the CPU and GPU, to get a clear view of the total application time when using GPU Coder™, but not the time to create our data. We therefore separate the data generation from the function that solves the linear system, and will generate code and measure only that operation.

```
type getData.m
```

function [A, b] = getData(n, clz) % Copyright 2017 The MathWorks, Inc. fprintf('Creating a matrix of size %d-by-%d.\n', n, n); A = rand(n, n, clz) + 100*eye(n, n, clz); b = rand(n, 1, clz); end

The backslash function encapsulates the (\) operation for which we want to generate code.

```
type backslash.m
```

function [x] = backslash(A,b) %#codegen % Copyright 2017 The MathWorks, Inc. coder.gpu.kernelfun(); x = A\b; end

We create a function to generate the GPU MEX function based on the particular input data size.

```
type genGpuCode.m
```

function [] = genGpuCode(A, b) % Copyright 2017 The MathWorks, Inc. cfg = coder.gpuConfig('mex'); evalc('codegen -config cfg -args {A,b} backslash'); end

As with a great number of other parallel algorithms, the performance of solving a linear system in parallel depends greatly on the matrix size. We compare the performance of the algorithm for different matrix sizes.

% Declare the matrix sizes to be a multiple of 1024. sizeLimit = inf; if ispc sizeLimit = double(intmax('int32')); end maxSizeSingle = min(floor(sqrt(maxMemory*1024^3/4)),floor(sqrt(sizeLimit/4))); maxSizeDouble = min(floor(sqrt(maxMemory*1024^3/8)),floor(sqrt(sizeLimit/8))); step = 1024; if maxSizeDouble/step >= 10 step = step*floor(maxSizeDouble/(5*step)); end sizeSingle = 1024:step:maxSizeSingle; sizeDouble = 1024:step:maxSizeDouble; numReps = 5;

We use the total elapsed time as our measure of performance because that allows us to compare the performance of the algorithm for different matrix sizes. Given a matrix size, the benchmarking function creates the matrix `A`

and the right-hand side `b`

once, and then solves `A\b`

a few times to get an accurate measure of the time it takes.

type benchFcnMat.m % We need to create a different function for GPU code execution that % invokes the generated GPU MEX function. type benchFcnGpu.m

function time = benchFcnMat(A, b, reps) % Copyright 2017 The MathWorks, Inc. time = inf; % We solve the linear system a few times and take the best run for itr = 1:reps tic; matX = backslash(A, b); tcurr = toc; time = min(tcurr, time); end end function time = benchFcnGpu(A, b, reps) % Copyright 2017 The MathWorks, Inc. time = inf; gpuX = backslash_mex(A, b); for itr = 1:reps tic; gpuX = backslash_mex(A, b); tcurr = toc; time = min(tcurr, time); end end

Having done all the setup, it is straightforward to execute the benchmarks. However, the computations can take a long time to complete, so we print some intermediate status information as we complete the benchmarking for each matrix size. We also encapsulate the loop over all the matrix sizes in a function, to benchmark both single- and double-precision computations.

It is important to note that actual execution times will vary across different hardware configurations. This benchmarking was done using MATLAB 18a on a machine with an 8 core, 2.6GHz Intel® Xeon® CPU and an NVIDIA Titan X GPU.

```
type executeBenchmarks.m
```

function [timeCPU, timeGPU] = executeBenchmarks(clz, sizes, reps) % Copyright 2017 The MathWorks, Inc. fprintf(['Starting benchmarks with %d different %s-precision ' ... 'matrices of sizes\nranging from %d-by-%d to %d-by-%d.\n'], ... length(sizes), clz, sizes(1), sizes(1), sizes(end), ... sizes(end)); timeGPU = zeros(size(sizes)); timeCPU = zeros(size(sizes)); for i = 1:length(sizes) n = sizes(i); fprintf('Size : %d\n', n); [A, b] = getData(n, clz); genGpuCode(A, b); timeCPU(i) = benchFcnMat(A, b, reps); fprintf('Time on CPU: %f sec\n', timeCPU(i)); timeGPU(i) = benchFcnGpu(A, b, reps); fprintf('Time on GPU: %f sec\n', timeGPU(i)); fprintf('\n'); end end

We then execute the benchmarks in single and double precision.

[cpu, gpu] = executeBenchmarks('single', sizeSingle, numReps); results.sizeSingle = sizeSingle; results.timeSingleCPU = cpu; results.timeSingleGPU = gpu; [cpu, gpu] = executeBenchmarks('double', sizeDouble, numReps); results.sizeDouble = sizeDouble; results.timeDoubleCPU = cpu; results.timeDoubleGPU = gpu;

Starting benchmarks with 9 different single-precision matrices of sizes ranging from 1024-by-1024 to 25600-by-25600. Size : 1024 Creating a matrix of size 1024-by-1024. Time on CPU: 0.008953 sec Time on GPU: 0.012897 sec Size : 4096 Creating a matrix of size 4096-by-4096. Time on CPU: 0.241337 sec Time on GPU: 0.056992 sec Size : 7168 Creating a matrix of size 7168-by-7168. Time on CPU: 0.979883 sec Time on GPU: 0.126366 sec Size : 10240 Creating a matrix of size 10240-by-10240. Time on CPU: 2.456934 sec Time on GPU: 0.251279 sec Size : 13312 Creating a matrix of size 13312-by-13312. Time on CPU: 4.969921 sec Time on GPU: 0.440791 sec Size : 16384 Creating a matrix of size 16384-by-16384. Time on CPU: 9.099973 sec Time on GPU: 0.699288 sec Size : 19456 Creating a matrix of size 19456-by-19456. Time on CPU: 15.161504 sec Time on GPU: 1.043330 sec Size : 22528 Creating a matrix of size 22528-by-22528. Time on CPU: 20.989874 sec Time on GPU: 1.494572 sec Size : 25600 Creating a matrix of size 25600-by-25600. Time on CPU: 23.794767 sec Time on GPU: 2.062284 sec Starting benchmarks with 7 different double-precision matrices of sizes ranging from 1024-by-1024 to 19456-by-19456. Size : 1024 Creating a matrix of size 1024-by-1024. Time on CPU: 0.013808 sec Time on GPU: 0.019805 sec Size : 4096 Creating a matrix of size 4096-by-4096. Time on CPU: 0.459535 sec Time on GPU: 0.169425 sec Size : 7168 Creating a matrix of size 7168-by-7168. Time on CPU: 1.899516 sec Time on GPU: 0.726797 sec Size : 10240 Creating a matrix of size 10240-by-10240. Time on CPU: 4.896854 sec Time on GPU: 1.970779 sec Size : 13312 Creating a matrix of size 13312-by-13312. Time on CPU: 10.263277 sec Time on GPU: 4.184999 sec Size : 16384 Creating a matrix of size 16384-by-16384. Time on CPU: 18.258044 sec Time on GPU: 7.642277 sec Size : 19456 Creating a matrix of size 19456-by-19456. Time on CPU: 21.981840 sec Time on GPU: 12.661797 sec

We can now plot the results, and compare the performance on the CPU and the GPU, both for single and double precision.

First, we look at the performance of the backslash operator in single precision.

fig = figure; ax = axes('parent', fig); plot(ax, results.sizeSingle, results.timeSingleGPU, '-x', ... results.sizeSingle, results.timeSingleCPU, '-o') grid on; legend('GPU', 'CPU', 'Location', 'NorthWest'); title(ax, 'Single-precision performance') ylabel(ax, 'Time (s)'); xlabel(ax, 'Matrix size'); drawnow;

Now, we look at the performance of the backslash operator in double precision.

fig = figure; ax = axes('parent', fig); plot(ax, results.sizeDouble, results.timeDoubleGPU, '-x', ... results.sizeDouble, results.timeDoubleCPU, '-o') legend('GPU', 'CPU', 'Location', 'NorthWest'); grid on; title(ax, 'Double-precision performance') ylabel(ax, 'Time (s)'); xlabel(ax, 'Matrix size'); drawnow;

Finally, we look at the speedup of the backslash operator when comparing the GPU to the CPU.

speedupDouble = results.timeDoubleCPU./results.timeDoubleGPU; speedupSingle = results.timeSingleCPU./results.timeSingleGPU; fig = figure; ax = axes('parent', fig); plot(ax, results.sizeSingle, speedupSingle, '-v', ... results.sizeDouble, speedupDouble, '-*') grid on; legend('Single-precision', 'Double-precision', 'Location', 'SouthEast'); title(ax, 'Speedup of computations on GPU compared to CPU'); ylabel(ax, 'Speedup'); xlabel(ax, 'Matrix size'); drawnow;

Remove the temporary files and return to the original folder.

cleanup