cuBLAS Example
This example multiplies two matrices A and B by
using the cuBLAS library. The MATLAB® implementation of GEneral Matrix-Matrix Multiplication (GEMM) is:
function [C] = blas_gemm(A,B) C = zeros(size(A)); C = A * B; end
Generated CUDA Code
When you generate CUDA® code, GPU Coder™ creates function calls to initialize the cuBLAS library, perform matrix-matrix operations, and release hardware resources that the cuBLAS library uses. The following is a snippet of the generated CUDA code.
cublasEnsureInitialization();
blas_gemm_kernel1<<<dim3(2048U, 1U, 1U), dim3(512U, 1U, 1U)>>>(gpu_C);
alpha1 = 1.0;
beta1 = 0.0;
cudaMemcpy((void *)gpu_alpha1, (void *)&alpha1, 8ULL, cudaMemcpyHostToDevice);
cudaMemcpy((void *)gpu_A, (void *)A, 8388608ULL, cudaMemcpyHostToDevice);
cudaMemcpy((void *)gpu_B, (void *)B, 8388608ULL, cudaMemcpyHostToDevice);
cudaMemcpy(gpu_beta1, &beta1, 8ULL, cudaMemcpyHostToDevice);
cublasDgemm(cublasGlobalHandle, CUBLAS_OP_N, CUBLAS_OP_N, 1024, 1024, 1024,
(double *)gpu_alpha1, (double *)&gpu_A[0], 1024, (double *)&gpu_B
[0], 1024, (double *)gpu_beta1, (double *)&gpu_C[0], 1024);
cublasEnsureDestruction();
cudaMemcpy((void *)C, (void *)gpu_C, 8388608ULL, cudaMemcpyDeviceToHost);To initialize the cuBLAS library and create a handle to the cuBLAS library context, the
function cublasEnsureInitialization() calls
cublasCreate() cuBLAS API. It allocates hardware resources on the host
and device.
static void cublasEnsureInitialization(void)
{
if (cublasGlobalHandle == NULL) {
cublasCreate(&cublasGlobalHandle);
cublasSetPointerMode(cublasGlobalHandle, CUBLAS_POINTER_MODE_DEVICE);
}
}blas_gemm_kernel1 initializes the result matrix C
to zero. This kernel is launched with 2048 blocks and 512 threads per block. These block and
thread values correspond to the size of C.
static __global__ __launch_bounds__(512, 1) void blas_gemm_kernel1(real_T *C)
{
int32_T threadIdX;
threadIdX = (int32_T)(blockDim.x * blockIdx.x + threadIdx.x);
if (!(threadIdX >= 1048576)) {
C[threadIdX] = 0.0;
}
}Calls to cudaMemcpy transfer the matrices A and
B from the host to the device. The function
cublasDgemm is a level-3 Basic Linear Algebra Subprogram (BLAS3) that
performs the matrix-matrix multiplication:
C = αAB + βC
where α and β are scalars, and A, B, and C are matrices stored in column-major format.
CUBLAS_OP_N controls transpose operations on the input matrices.
The final calls are to cublasEnsureDestruction() and another
cudaMemcpy. cublasEnsureDestruction() calls
cublasCreate() cuBLAS API to release hardware resources the cuBLAS
library uses. cudaMemcpy copies the result matrix C
from the device to the host.
static void cublasEnsureDestruction(void)
{
if (cublasGlobalHandle != NULL) {
cublasDestroy(cublasGlobalHandle);
cublasGlobalHandle = NULL;
}
}Prepare blas_gemm for Kernel Creation
GPU Coder requires no special pragma to generate calls to
libraries. There are two ways to generate CUDA kernels — coder.gpu.kernelfun and coder.gpu.kernel. In this example, we utilize the
coder.gpu.kernelfun pragma to generate CUDA kernels. The modified blas_gemm function is:
function [C] = blas_gemm(A,B) %#codegen C = coder.nullcopy(zeros(size(A))); coder.gpu.kernelfun; C = A * B; end
Note
A minimum size (128 elements) is required on the input data for replacing math operators and functions with cuBLAS library implementations.