GPU Coder codegen slow performance

Question

Alessandro il 15 Ago 2025

0
Link

Link diretto a questa domanda

https://it.mathworks.com/matlabcentral/answers/2179387-gpu-coder-codegen-slow-performance

Commentato: Chao Luo il 21 Ago 2025

I am running the code shown below with MEX CUDA. Following Chao Luo's suggestion, I prepared a MWE, so that anyone who is interested can run this.

The computational bottleneck is the while loop in this function:

function [V,Policy,Params] = fun_VFI_cuda(p_eqm, a_grid, z_grid, pi_z, Params, vfoptions)
coder.gpu.kernelfun;
Params.r = p_eqm(1);
Params.w = p_eqm(2);
verbose   = vfoptions.verbose ;
%% 1 First solve the value function
n_a = length(a_grid) ;
n_z = length(z_grid) ;
if verbose>=1
    disp('Start Value Function Iteration')
end
r       = p_eqm(1);
w       = p_eqm(2);
lambda  = Params.lambda;
delta   = Params.delta;
alpha   = Params.alpha;
upsilon = Params.upsilon;
crra    = Params.crra;
beta    = Params.beta;
%% 1.1 the return matrix
ReturnMatrix = coder.nullcopy(zeros(n_a,n_a,n_z)) ;
cash = coder.nullcopy(zeros(n_a,n_z)) ;
tic
coder.gpu.kernel()
for z_c = 1:n_z
    coder.gpu.constantMemory(z_grid);
    z = z_grid(z_c);
    for a_c=1:n_a
        coder.gpu.constantMemory(a_grid);
        a = a_grid(a_c);
        profit = solve_entre(a,z,w,r,lambda,delta,alpha,upsilon);
        cash(a_c,z_c) = max(w,profit) + (1+r)*a; % cash depends only on (a,z)
    end
end
coder.gpu.kernel()
for z_c = 1:n_z
    for a_c = 1:n_a 
        coder.gpu.constantMemory(cash);
        cash_is = cash(a_c,z_c) ;
        for aprime_c = 1 : n_a  % Now introduce a'
           
           coder.gpu.constantMemory(a_grid);
           cons = cash_is - a_grid(aprime_c);   
           if cons > 0
               ReturnMatrix(aprime_c,a_c,z_c) = (cons^(1-crra))/(1-crra);
           else
               ReturnMatrix(aprime_c,a_c,z_c) = - Inf ;
           end %end if
        end %end aprime
    end %end a
end %end z
time_ret = toc;
%% 1.1 the value and policy function
V0     = coder.nullcopy(zeros(n_a,n_z)) ; % Initial guess V0
V      = coder.nullcopy(zeros(n_a,n_z)) ;
Policy = coder.nullcopy(zeros(n_a,n_z)) ;
err  = vfoptions.tolerance+1;
iter = 1;
pi_z_tranposed = pi_z';
tic
while err > vfoptions.tolerance
    EV = V0*pi_z_tranposed; %EV(a',z)
    
    coder.gpu.kernel()
    for z_c=1:n_z
        for a_c=1:n_a
            tmpmax = - Inf ;
            maxid = 1 ;
            for aprime_c = 1 : n_a
                coder.gpu.constantMemory(ReturnMatrix);
                coder.gpu.constantMemory(EV);
                entireRHS = ReturnMatrix(aprime_c,a_c,z_c) + beta*EV(aprime_c,z_c);
                if tmpmax < entireRHS 
                    tmpmax = entireRHS ;
                    maxid = aprime_c ;
                end
                V(a_c,z_c)      = tmpmax;
                Policy(a_c,z_c) = maxid;
            end
        end
    end
    % -------------------------- Howard ----------------------------------%
    for h_c = 1 : vfoptions.howards
        EVh = V*pi_z_tranposed;
        
        coder.gpu.kernel()
        for z_c = 1:n_z
            for a_c = 1:n_a
                coder.gpu.constantMemory(Policy);
                coder.gpu.constantMemory(ReturnMatrix);
                coder.gpu.constantMemory(EVh);
                aprime_opt = Policy(a_c,z_c) ;
                V(a_c,z_c) = ReturnMatrix(aprime_opt,a_c,z_c) + beta*EVh(aprime_opt,z_c) ;
            end
        end
    end %end howards
    % --------------------------------- ----------------------------------%
    
    % Update
    err = max(abs(V(:)-V0(:)));
    % if verbose == 2
    %     fprintf('iter = %4.0f, err = %f \n',iter,err)
    % end
    V0 = V;
    iter = iter+1;
end %end while
time_vfi = toc;
if verbose >= 1
    fprintf('Time return matrix:       %8.6f \n',time_ret);
    fprintf('Time vfi:                 %8.6f \n',time_vfi);
    fprintf('Time return matrix + vfi: %8.6f \n',time_ret+time_vfi);
end
end %end function fun_VFI_cuda

Please note that this function calls the function solve_entre, which I copy and paste here:

function [profit,kstar,lstar] = solve_entre(a,z,w,r,lambda,delta,alpha,upsilon)
coder.gpu.kernelfun;
% Get k1, kstar, lstar
%aux    = 1-(1-alpha)*(1-upsilon);
k1a    = (1/(max(r+delta,1e-8)))*alpha*(1-upsilon)*z;
k1b    = (1/w)*(1-alpha)*(1-upsilon)*z;
inside = k1a^(1-(1-alpha)*(1-upsilon)) * k1b^((1-alpha)*(1-upsilon));
k1     = (inside)^(1/upsilon);
kstar  = min(k1,lambda*a);
inside_lab = (1/w)*(1-alpha)*(1-upsilon)*z *kstar^(alpha*(1-upsilon));
lstar  = ( inside_lab )^(1/(1-(1-alpha)*(1-upsilon)));
% Evaluate profit if do choose to be entrepreneur
profit = z*((kstar^alpha)*(lstar^(1-alpha)) )^(1-upsilon) -w*lstar -(delta+r)*kstar;
end %end function solve_entre

I compile and run the code with the following script. Note that I want the input argument a_grid to be of variable size.

clear,clc,close all
%% Parameters
Params.crra    = 1.5;   
Params.beta    = 0.904; 
Params.delta   = 0.06; 
Params.alpha   = 0.33; 
Params.upsilon = 0.21; 
Params.psi     = 0.894; 
Params.eta     = 4.15; 
Params.lambda  = inf; 
Params.r       = 0.0476;  
Params.w       = 0.172; 
p_eqm(1)       = Params.r;
p_eqm(2)       = Params.w;
vfoptions           = struct();
vfoptions.verbose   = 1;
vfoptions.lowmemory = 0;
vfoptions.tolerance = 1e-9;
vfoptions.howards   = 80;
%% Grids 
a_min  = 1e-6; 
a_max  = 4000; 
a_scale = 2; 
n_a = 1000;
a_grid  = a_min+(a_max-a_min)*linspace(0,1,n_a)'.^a_scale;
% z_grid is 40*1 vector
% pi_z is 40*40 transition matrix with rows summing up to one.
n_z = 40;
z_grid = linspace(0.5,1.5,n_z)';  % 40x1 vector
pi_z = rand(n_z,n_z);
pi_z = pi_z./sum(pi_z,2);
%% Compile to get C mex
% Define variable size inputs
vs_a_grid = coder.typeof(a_grid,[2001,1],1);
cfg = coder.gpuConfig('mex');
cfg.GenerateReport = true;
codegen -config cfg fun_VFI_cuda -args {zeros(1,2), vs_a_grid, z_grid, pi_z, Params, vfoptions} -o fun_VFI_cuda_mex
%% Call mex function
[V,Policy,Params] = fun_VFI_cuda_mex(p_eqm, a_grid, z_grid, pi_z,Params,vfoptions) ;
% THE END

The problem is that the performance of the function fun_VFI_cuda_mex is only slightly faster than Matlab native code. Of course this can be driven by some harware factors such as the Nvidia GPU graphics card I am using; However, I would like to ask if there is something I can do to improve the code above. For example, everything is with loops since I thought this is easier to translate for the gpu coder, but maybe I am wrong.

Any comment/answer is greatly appreciated!

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Accedi per commentare.

Accedi per rispondere a questa domanda.

Answer 1

Chao Luo il 18 Ago 2025

1
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/2179387-gpu-coder-codegen-slow-performance#answer_1569479

Hi Alessandro,

The code you pasted here seems not complete that I cannot try it. But you can try using gpuPerformanceAnalyzer to profile the generated code.

By just looking at the code, I can see the code of line 15-31 can be rewriten using gpucoder.reduce to improve the performance.

Also, if you can give the full reproduction, I can investigate more.

9 Commenti
Mostra 7 commenti meno recentiNascondi 7 commenti meno recenti

Chao Luo il 19 Ago 2025

Modificato: Walter Roberson il 20 Ago 2025

Apri in MATLAB Online

For 25a, you can replace the for-loop computing the max index with gpucoder.internal.max. The performance on my side is 20X faster.

You can profile the generate code like this:

gpuPerformanceAnalyzer("fun_VFI_cuda.m", {p_eqm, a_grid, z_grid, pi_z,Params,vfoptions}, Config=cfg)
gpuPerformanceAnalyzer("fun_VFI_cuda2.m", {p_eqm, a_grid, z_grid, pi_z,Params,vfoptions}, Config=cfg)

Here is the code:

function [V,Policy,Params] = fun_VFI_cuda2(p_eqm, a_grid, z_grid, pi_z, Params, vfoptions)
coder.gpu.kernelfun;
Params.r = p_eqm(1);
Params.w = p_eqm(2);
verbose   = vfoptions.verbose ;
%% 1 First solve the value function
n_a = length(a_grid) ;
n_z = length(z_grid) ;
if verbose>=1
    disp('Start Value Function Iteration')
end
r       = p_eqm(1);
w       = p_eqm(2);
lambda  = Params.lambda;
delta   = Params.delta;
alpha   = Params.alpha;
upsilon = Params.upsilon;
crra    = Params.crra;
beta    = Params.beta;
%% 1.1 the return matrix
ReturnMatrix = coder.nullcopy(zeros(n_a,n_a,n_z)) ;
cash = coder.nullcopy(zeros(n_a,n_z)) ;
tic
coder.gpu.kernel()
for z_c = 1:n_z
    coder.gpu.constantMemory(z_grid);
    z = z_grid(z_c);
    for a_c=1:n_a
        coder.gpu.constantMemory(a_grid);
        a = a_grid(a_c);
        profit = solve_entre(a,z,w,r,lambda,delta,alpha,upsilon);
        cash(a_c,z_c) = max(w,profit) + (1+r)*a; % cash depends only on (a,z)
    end
end
coder.gpu.kernel()
for z_c = 1:n_z
    for a_c = 1:n_a 
        coder.gpu.constantMemory(cash);
        cash_is = cash(a_c,z_c) ;
        for aprime_c = 1 : n_a  % Now introduce a'
            
            coder.gpu.constantMemory(a_grid);
            cons = cash_is - a_grid(aprime_c);   
            if cons > 0
                ReturnMatrix(aprime_c,a_c,z_c) = (cons^(1-crra))/(1-crra);
            else
                ReturnMatrix(aprime_c,a_c,z_c) = - Inf ;
            end %end if
        end %end aprime
    end %end a
end %end z
time_ret = toc;
%% 1.1 the value and policy function
V0     = zeros(n_a,n_z) ; % Initial guess V0
err  = vfoptions.tolerance+1;
V      = coder.nullcopy(zeros(n_a,n_z));
Policy = coder.nullcopy(zeros(n_a,n_z)) ;
iter = 1;
pi_z_tranposed = pi_z';
tic
while err > vfoptions.tolerance
    EV = beta*V0*pi_z_tranposed; %EV(a',z)
    % Compute ReturnMatrix - EV
    entireRHS = coder.nullcopy(ReturnMatrix);
    for z_c = 1:n_z
        for a_c = 1:n_a
            for aprime_c = 1:n_a
                entireRHS(aprime_c, a_c, z_c) = ReturnMatrix(aprime_c, a_c, z_c) + EV(aprime_c,z_c);
            end
        end
    end
    [V3, Policy3] = gpucoder.internal.max(entireRHS, 'dim', 1);
    V = squeeze(V3);
    Policy = squeeze(Policy3);
    
    % coder.gpu.kernel()
    % for z_c=1:n_z
    %     for a_c=1:n_a
    %         tmpmax = - Inf ;
    %         maxid = 1 ;
    %         for aprime_c = 1 : n_a
    %             coder.gpu.constantMemory(ReturnMatrix);
    %             coder.gpu.constantMemory(EV);
    %             entireRHS = ReturnMatrix(aprime_c,a_c,z_c) + beta*EV(aprime_c,z_c);
    %             if tmpmax < entireRHS 
    %                 tmpmax = entireRHS ;
    %                 maxid = aprime_c ;
    %             end
    %             V(a_c,z_c)      = tmpmax;
    %             Policy(a_c,z_c) = maxid;
    %         end
    %     end
    % end
    
    
    % -------------------------- Howard ----------------------------------%
    for h_c = 1 : vfoptions.howards
        EVh = beta*V*pi_z_tranposed;
        
        coder.gpu.kernel()
        for z_c = 1:n_z
            for a_c = 1:n_a
                aprime_opt = Policy(a_c,z_c) ;
                V(a_c,z_c) = ReturnMatrix(aprime_opt,a_c,z_c) + EVh(aprime_opt,z_c) ;
            end
        end
    end %end howards
    % --------------------------------- ----------------------------------%
    
    % Update
    err = max(abs(V(:)-V0(:)));
    V0 = V;
    iter = iter+1;
end %end while
time_vfi = toc;
if verbose >= 1
    fprintf('Number of iteration:      %8.6f \n',iter);
    fprintf('Time return matrix:       %8.6f \n',time_ret);
    fprintf('Time vfi:                 %8.6f \n',time_vfi);
    fprintf('Time return matrix + vfi: %8.6f \n',time_ret+time_vfi);
end
end %end function fun_VFI_cuda

Alessandro il 21 Ago 2025

Modificato: Alessandro il 21 Ago 2025

Apri in MATLAB Online

I have an update. Replacing squeeze with reshape fixes the problem. Thanks @Walter Roberson!

@Chao Luo I checked the running time and it turns out that your code with

entireRHS = coder.nullcopy(ReturnMatrix);
    coder.gpu.kernel()
    for z_c=1:n_z
        for a_c=1:n_a
            for aprime_c = 1 : n_a
                entireRHS(aprime_c, a_c, z_c)=ReturnMatrix(aprime_c,a_c,z_c)+beta*EV(aprime_c,z_c);
            end
        end
    end
    [V3, Policy3] = gpucoder.internal.max(entireRHS, 'dim', 1);

is slightly (20%) faster than my original code with the "max" implemented manually via a for loop:

coder.gpu.kernel()
    for z_c=1:n_z
        for a_c=1:n_a
            max_val = -inf;
            max_ind = -1;
            for aprime_c = 1 : n_a
                entireRHS = ReturnMatrix(aprime_c,a_c,z_c)+beta*EV(aprime_c,z_c);
                if entireRHS>max_val
                    max_val = entireRHS;
                    max_ind = aprime_c;
                end
            end
            V(a_c,z_c) = max_val;
            Policy(a_c,z_c) = max_ind; % Store the optimal policy
        end
    end
    

So the conclusion is that if one has 2025a, it is better to calculate the array and then its max via gpucoder.internal.max instead of doing the reduction with a loop.

@Chao Luo: For 2024b and earlier version, can I add a directive to the reduction loop to tell the compiler that it is a reduction? Thanks!

Chao Luo il 21 Ago 2025

Apri in MATLAB Online

gpucoder.internal.max is not supported for releases earlier than 25a.

You can try a different approach:

maxVal = gpucoder.reduce(entireRHS, @mymax, 'dim', 1)
maxIdx = coder.ones(n_a, n_z) * int32(n_a);
coder.gpu.kernel()
for z_c = 1:n_z
    coder.gpu.kernel()
    for a_c = 1:n_a
        coder.gpu.kernel()
        for aprime_c = 1:n_a
            if entireRHS(aprime_c, a_c, z_c) == maxVal(1, a_c, z_c)
                maxIdx(a_c, z_c) = gpucoder.atomicMin(maxIdx(a_c, z_c), int32(aprime_c));
            end
        end
    end
end
function out = mymax(lhs, rhs)
out = max(lhs, rhs);
end

Accedi per commentare.

GPU Coder codegen slow performance

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Risposta accettata

9 Commenti
Mostra 7 commenti meno recentiNascondi 7 commenti meno recenti

Più risposte (0)

Vedere anche

Categorie

Tag

Prodotti

Release

Community Treasure Hunt

GPU Coder codegen slow performance

0 Commenti Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Risposta accettata

9 Commenti Mostra 7 commenti meno recentiNascondi 7 commenti meno recenti

Più risposte (0)

Vedere anche

Categorie

Tag

Prodotti

Release

Community Treasure Hunt

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

9 Commenti
Mostra 7 commenti meno recentiNascondi 7 commenti meno recenti