GPU Coder codegen slow performance
5 visualizzazioni (ultimi 30 giorni)
Mostra commenti meno recenti
I am running the code shown below with MEX CUDA. Following Chao Luo's suggestion, I prepared a MWE, so that anyone who is interested can run this.
The computational bottleneck is the while loop in this function:
function [V,Policy,Params] = fun_VFI_cuda(p_eqm, a_grid, z_grid, pi_z, Params, vfoptions)
coder.gpu.kernelfun;
Params.r = p_eqm(1);
Params.w = p_eqm(2);
verbose = vfoptions.verbose ;
%% 1 First solve the value function
n_a = length(a_grid) ;
n_z = length(z_grid) ;
if verbose>=1
disp('Start Value Function Iteration')
end
r = p_eqm(1);
w = p_eqm(2);
lambda = Params.lambda;
delta = Params.delta;
alpha = Params.alpha;
upsilon = Params.upsilon;
crra = Params.crra;
beta = Params.beta;
%% 1.1 the return matrix
ReturnMatrix = coder.nullcopy(zeros(n_a,n_a,n_z)) ;
cash = coder.nullcopy(zeros(n_a,n_z)) ;
tic
coder.gpu.kernel()
for z_c = 1:n_z
coder.gpu.constantMemory(z_grid);
z = z_grid(z_c);
for a_c=1:n_a
coder.gpu.constantMemory(a_grid);
a = a_grid(a_c);
profit = solve_entre(a,z,w,r,lambda,delta,alpha,upsilon);
cash(a_c,z_c) = max(w,profit) + (1+r)*a; % cash depends only on (a,z)
end
end
coder.gpu.kernel()
for z_c = 1:n_z
for a_c = 1:n_a
coder.gpu.constantMemory(cash);
cash_is = cash(a_c,z_c) ;
for aprime_c = 1 : n_a % Now introduce a'
coder.gpu.constantMemory(a_grid);
cons = cash_is - a_grid(aprime_c);
if cons > 0
ReturnMatrix(aprime_c,a_c,z_c) = (cons^(1-crra))/(1-crra);
else
ReturnMatrix(aprime_c,a_c,z_c) = - Inf ;
end %end if
end %end aprime
end %end a
end %end z
time_ret = toc;
%% 1.1 the value and policy function
V0 = coder.nullcopy(zeros(n_a,n_z)) ; % Initial guess V0
V = coder.nullcopy(zeros(n_a,n_z)) ;
Policy = coder.nullcopy(zeros(n_a,n_z)) ;
err = vfoptions.tolerance+1;
iter = 1;
pi_z_tranposed = pi_z';
tic
while err > vfoptions.tolerance
EV = V0*pi_z_tranposed; %EV(a',z)
coder.gpu.kernel()
for z_c=1:n_z
for a_c=1:n_a
tmpmax = - Inf ;
maxid = 1 ;
for aprime_c = 1 : n_a
coder.gpu.constantMemory(ReturnMatrix);
coder.gpu.constantMemory(EV);
entireRHS = ReturnMatrix(aprime_c,a_c,z_c) + beta*EV(aprime_c,z_c);
if tmpmax < entireRHS
tmpmax = entireRHS ;
maxid = aprime_c ;
end
V(a_c,z_c) = tmpmax;
Policy(a_c,z_c) = maxid;
end
end
end
% -------------------------- Howard ----------------------------------%
for h_c = 1 : vfoptions.howards
EVh = V*pi_z_tranposed;
coder.gpu.kernel()
for z_c = 1:n_z
for a_c = 1:n_a
coder.gpu.constantMemory(Policy);
coder.gpu.constantMemory(ReturnMatrix);
coder.gpu.constantMemory(EVh);
aprime_opt = Policy(a_c,z_c) ;
V(a_c,z_c) = ReturnMatrix(aprime_opt,a_c,z_c) + beta*EVh(aprime_opt,z_c) ;
end
end
end %end howards
% --------------------------------- ----------------------------------%
% Update
err = max(abs(V(:)-V0(:)));
% if verbose == 2
% fprintf('iter = %4.0f, err = %f \n',iter,err)
% end
V0 = V;
iter = iter+1;
end %end while
time_vfi = toc;
if verbose >= 1
fprintf('Time return matrix: %8.6f \n',time_ret);
fprintf('Time vfi: %8.6f \n',time_vfi);
fprintf('Time return matrix + vfi: %8.6f \n',time_ret+time_vfi);
end
end %end function fun_VFI_cuda
Please note that this function calls the function solve_entre, which I copy and paste here:
function [profit,kstar,lstar] = solve_entre(a,z,w,r,lambda,delta,alpha,upsilon)
coder.gpu.kernelfun;
% Get k1, kstar, lstar
%aux = 1-(1-alpha)*(1-upsilon);
k1a = (1/(max(r+delta,1e-8)))*alpha*(1-upsilon)*z;
k1b = (1/w)*(1-alpha)*(1-upsilon)*z;
inside = k1a^(1-(1-alpha)*(1-upsilon)) * k1b^((1-alpha)*(1-upsilon));
k1 = (inside)^(1/upsilon);
kstar = min(k1,lambda*a);
inside_lab = (1/w)*(1-alpha)*(1-upsilon)*z *kstar^(alpha*(1-upsilon));
lstar = ( inside_lab )^(1/(1-(1-alpha)*(1-upsilon)));
% Evaluate profit if do choose to be entrepreneur
profit = z*((kstar^alpha)*(lstar^(1-alpha)) )^(1-upsilon) -w*lstar -(delta+r)*kstar;
end %end function solve_entre
I compile and run the code with the following script. Note that I want the input argument a_grid to be of variable size.
clear,clc,close all
%% Parameters
Params.crra = 1.5;
Params.beta = 0.904;
Params.delta = 0.06;
Params.alpha = 0.33;
Params.upsilon = 0.21;
Params.psi = 0.894;
Params.eta = 4.15;
Params.lambda = inf;
Params.r = 0.0476;
Params.w = 0.172;
p_eqm(1) = Params.r;
p_eqm(2) = Params.w;
vfoptions = struct();
vfoptions.verbose = 1;
vfoptions.lowmemory = 0;
vfoptions.tolerance = 1e-9;
vfoptions.howards = 80;
%% Grids
a_min = 1e-6;
a_max = 4000;
a_scale = 2;
n_a = 1000;
a_grid = a_min+(a_max-a_min)*linspace(0,1,n_a)'.^a_scale;
% z_grid is 40*1 vector
% pi_z is 40*40 transition matrix with rows summing up to one.
n_z = 40;
z_grid = linspace(0.5,1.5,n_z)'; % 40x1 vector
pi_z = rand(n_z,n_z);
pi_z = pi_z./sum(pi_z,2);
%% Compile to get C mex
% Define variable size inputs
vs_a_grid = coder.typeof(a_grid,[2001,1],1);
cfg = coder.gpuConfig('mex');
cfg.GenerateReport = true;
codegen -config cfg fun_VFI_cuda -args {zeros(1,2), vs_a_grid, z_grid, pi_z, Params, vfoptions} -o fun_VFI_cuda_mex
%% Call mex function
[V,Policy,Params] = fun_VFI_cuda_mex(p_eqm, a_grid, z_grid, pi_z,Params,vfoptions) ;
% THE END
The problem is that the performance of the function fun_VFI_cuda_mex is only slightly faster than Matlab native code. Of course this can be driven by some harware factors such as the Nvidia GPU graphics card I am using; However, I would like to ask if there is something I can do to improve the code above. For example, everything is with loops since I thought this is easier to translate for the gpu coder, but maybe I am wrong.
Any comment/answer is greatly appreciated!
0 Commenti
Risposta accettata
Chao Luo
il 18 Ago 2025
Hi Alessandro,
The code you pasted here seems not complete that I cannot try it. But you can try using gpuPerformanceAnalyzer to profile the generated code.
By just looking at the code, I can see the code of line 15-31 can be rewriten using gpucoder.reduce to improve the performance.
Also, if you can give the full reproduction, I can investigate more.
9 Commenti
Chao Luo
il 21 Ago 2025
gpucoder.internal.max is not supported for releases earlier than 25a.
You can try a different approach:
maxVal = gpucoder.reduce(entireRHS, @mymax, 'dim', 1)
maxIdx = coder.ones(n_a, n_z) * int32(n_a);
coder.gpu.kernel()
for z_c = 1:n_z
coder.gpu.kernel()
for a_c = 1:n_a
coder.gpu.kernel()
for aprime_c = 1:n_a
if entireRHS(aprime_c, a_c, z_c) == maxVal(1, a_c, z_c)
maxIdx(a_c, z_c) = gpucoder.atomicMin(maxIdx(a_c, z_c), int32(aprime_c));
end
end
end
end
function out = mymax(lhs, rhs)
out = max(lhs, rhs);
end
Più risposte (0)
Vedere anche
Categorie
Scopri di più su Kernel Creation from MATLAB Code in Help Center e File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!