Alternatives to accumarray for faster calculations?

32 visualizzazioni (ultimi 30 giorni)
Hello, I have taken someone else's code which had several for-loops in it, and vectorized the code. The code now runs extremelly fast and even faster when using gpuArray(). I'm at a point now where the slowest part of the code is where I used the accumarray function. I'm not sure if there are alternative approaches to vectorizing code without using accum array.
Here is a rough example of the code I vectorized and an exampe of that code, vectorized
nx = 16;
nz = 512;
A = rand(nx,nx,nz);
%% First Method For Loops
R1 = zeros(nx,nx,nz);
B1 = zeros(nx,nz);
N1 = zeros(nx,nz);
for k = 1:nz
for j = 1:nx
for i = 1:nx
r = round((i^2+j^2)^0.5);
if(r <= nx)
R1(i,j,k) = r;
N1(R1(i,j,k),k) = N1(R1(i,j,k),k)+1;
B1(R1(i,j,k),k) = B1(R1(i,j,k),k) + A(i,j,k);
end
end
end
N1(1,k) = 1;
end
B1 = B1./N1;
%% Second Method Vectorized
[I,J] = ndgrid(1:nx,1:nx);
r = (I.^2+J.^2).^0.5;
R2 = round(r);
L21 = (R2 <= nx);
N2 = squeeze(repmat(histcounts(L21.*R2(:,:,1),1:nx+1),[1 1 nz]));
A2 = reshape(A,[1,numel(A)]);
B2 = reshape(repmat(L21.*R2,[1 1 nz]),[1,numel(A)]);
L22 = logical(B2).*repelem(0:nx:(nz-1)*nx,(nx)^2);
B2 = B2+L22;
B2 = reshape(accumarray((B2+logical(~B2)).',A2.*logical(B2)).',[nx nz])./N2;
% Check that Vectorized Code = For Loop Code
sum(sum(sum(B1-B2)))
The vectorized code isn't much faster when using gpuArray() and is only a tiny bit faster than the non-vectorized code without using gpuArray(). It is currently the bottleneck within my code and I'm not sure what other functions might be able to help me out here.
The great thing about the vectorized code is that I can put it into a function and pull several of the variables out of the function in an initialize step. Still though, the accumarray takes a significant amount of time to run compared to all other parts of my code (not shown) that is vectorized. And when accumarray is put into a function, is appears to be slower when I put tic/toc just outside of the function.
Thanks!

Risposta accettata

Andrea Picciau
Andrea Picciau il 20 Apr 2020
Modificato: Andrea Picciau il 20 Apr 2020
Hi Nathan,
Accumulation operations require synchronisation and are never going to squeeze the best performance out of your GPU. This is valid for all parallel computer architectures, by the way.
As a way to help myself understand your code, I rewrote the last line:
function tempData = iOriginalCode(A, B)
% Breakdown of original code
logicalB = logical(B);
indexes = B + ~logicalB;
indexes = indexes.';
values = A.*logicalB;
values = values.';
tempData = accumarray(indexes, values);
end
I can't immediately see how to change this code to improve performance and the rest of your vectorized code is not readable enough. I'm 90% sure the right thing to do is to go back to designing your algorithm to avoid that accumarray... you should take some of these things into consideration:
Try not to think in terms of indexes and values. Instead, try to formulate as many things as possible in terms of mathematical operations. You're already doing this in some parts of your code. For example, you're writing
A(~logicalB)=0
as
A.*logicalB
(see my re-written code). Eventually, you might be able re-write your accumarray as a matrix-vector multiplication, which should help improve performance.
Use the structure in your data to simplify your code. You're using a lot of repmats, reshapes, and a repelem to generate your indexes array. You can most likely find a better way to re-write that for-loop-based code.
I would go back to the original code and think again about the structure of this R
j = 1:nx;
i = 1:nx;
r = (i^2+j^2)^0.5;
R = round(r);
R(R>nx/2) = 0;
Use gputimeit to measure your GPU performance. The right way to measure the performance of the GPU on this is to use gputimeit, and on my K40c (with the data obtained from your script) I got this:
>> gputimeit(@() iOriginalCode(A, B), 1)
ans =
0.1309
The profiler can help you, but it's going to mess with the underlying flow of operations on the GPU. By the way, the profiler agrees that accumarray is the bottleneck here.
  9 Commenti
Andrea Picciau
Andrea Picciau il 25 Apr 2020
I used that cast as a hack to make sure everything's done on the GPU. If you remove that, you'll have to have
allindices = gpuArray(1:nx);
otherwise iSelectRows is going to be executed on the CPU instead of the GPU!
The multi-dimension technique in iSumByPage is really necessary to be able to select the elements of A as an elementwise array product
indices = repmat(indices, [1, 1, 1, size(A,3)]);
A = reshape(A, [size(A,1), size(A,2), 1, size(A,3)]);
A = A.*indices;
which is quite fast on a GPU, instead of using indexing, which I would expect to be slower. Keep in mind that this technique uses more GPU memory: at some point, allocating all that GPU memory is going to be slower than a for loop with indexing. You'll have to evaluate the performance depending on your application...
I enjoyed answering your question very much. If you found the answer useful, please upvote it for the others to see!
Nathan Zechar
Nathan Zechar il 27 Apr 2020
Thank you again Andrea, hope to be as amazing as you someday!

Accedi per commentare.

Più risposte (0)

Prodotti


Release

R2020a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by