Parallel matrix operations on GPU using arrayfun: why is it slower than looping, and/or is there a better method for coding this?

5 visualizzazioni (ultimi 30 giorni)
I have been attempting to parralelize a stack of matrix operations performed to the same root matrix on a GPU. A MWE is included below. The basic problem can be summed up using this example:
let im be some matrix of size [M x N] . let [c1,c2...] be some parameter vectors of length K. The operation can be given by the equation
out = sum_kk = [1:K] of f(im,c1(kk),c2(kk)...),
where f() is some function of the input matrix im, and the parameters. The output is the same size as the original input matrix.Since the function is being repeated on the same matrix just using different parameters, I should be able to parallelize the function operations, even if I have to perform the summation as a subsequent step.
I did this using arrayfun, but in my implementation it is slower than running as a loop. I do not understand why, so I have two primary questions:
  1. Is my implementation bad? Is there a better way to perform this type of operation?
  2. Is there a fundamental reason arrayfun isn't outperforming the loop?
This is a stand-in for a much more complicated problem, but should get me where I need to be. There is a minimal working example below. NOTE: the sum over the cells in the arrayfun portion is not the main chokepoint in speed, its arrayfun itself.
%%MWE of highly parallel operations on GPU
function highParMWE
%%Looped Version
% lets create a 500x500 pixel image
im = gpuArray(rand(500,500));
% lets create two 'corrections' to the image, with 101 different realizations.
corr1 = gpuArray(rand(101,1));
corr2 = gpuArray(rand(101,1));
%
% I want to create an output image the same size as the original, but with
% the following equation applied:
%
% new_im = Sum_i=1:101 ((im-corr1(ii))/corr2(ii)).
%
% So I am applying each correction pair to the original image, then summing
% all of the resulting image. Note, this is a stand in for an application
% with much more complex functions than the above equation. (I realize this
% exact example might be more quickly done with 3-D arrays, but that is not
% what I am going for.
%
% My first instinct is to use a 'for' loop. However, Since these are
% relatively small arrays, I was hoping to parallelize the calculation on
% the GPU. I may be misunderstanding the documentation, but arrayfun seems
% like the right way to do this.
%
% BUT, when I time it, the loop is actually faster.
%
t_loop = gputimeit(@() loopit(im,corr1,corr2));
%
t_array = gputimeit(@() arrayfunit(im,corr1,corr2));
%
% My questions:
%
% (1) Is there a better way to do this?
% (2) Why is arrayfun not significantly faster than looping? Often it is
% slower.
end
%
function imfinal = loopit(im,corr1,corr2)
imfinal = gpuArray.zeros(size(im));
for ii = 1:length(corr1);
imfinal = imfinal + calculateCorrImage(im,corr1(ii),corr2(ii));
end
imfinal = gather(imfinal);
end
%
function imfinal = arrayfunit(im,corr1,corr2)
imfinal = gpuArray.zeros(size(im));
% I have to set uniform to 0 because I am outputing an image matrix
% that is not the same size as my number of corrections
jj = 1:length(corr1);
tmp = arrayfun(@(jj) calculateCorrImage_arr(jj,im, corr1,corr2), jj,'UniformOutput',0);
% arrayfun has dumped out a cell array which I still
% have to sum over.
imfinal = gpuArray.zeros(size(im));
for ii = 1:length(corr1);
imfinal = imfinal + tmp{ii};
end
imfinal = gather(imfinal);
%
end
%
function out = calculateCorrImage(im,corr1,corr2)
out = (im+corr1)/corr2;
end
%
function out = calculateCorrImage_arr(ii,im,corr1,corr2)
out = (im+corr1(ii))/corr2(ii);
end
Thanks again all!

Risposta accettata

Joss Knight
Joss Knight il 22 Set 2016
arrayfun IS just a loop, unless the inputs are gpuArrays. Your input isn't a gpuArray, because you've passed all your gpuArrays as up-values. What you're doing is just simple algebra with scalar expansion:
corr1 = reshape(corr1, [], [], length(corr1));
corr2 = reshape(corr2, [], [], length(corr2));
imfinal = sum( arrayfun(@calculateCorrImage, im, corr1, corr2), 3 );
  3 Commenti
Joss Knight
Joss Knight il 30 Set 2016
Modificato: Joss Knight il 30 Set 2016
Hi Dan, that's fine, I generally brush off a quick response and wait to see if anything I say needs clarification, because often the original poster never sees my response!
In short, arrayfun accepts inputs all of the same size, and outputs a result of that size. It is an 'element-wise' operator that can only handle 'embarrassingly parallel' problems, meaning carrying out the same exact operation on every element of the array.
A trivial example would be adding two arrays together:
C = arrayfun(@plus, A, B);
But of course that's just the same as
C = A + B;
When you want to do something more complicated, but equally 'element-wise', arrayfun comes into its own.
C = (A.*X + B) ./ K;
translates to
C = arrayfun(@(a,x,b,k) (a*x + b)/k, A, X, B, K);
Why is that better? It may run faster, because arrayfun compiles a kernel that does that sequence of operations, whereas the 'standard' approach may have to launch several kernels.
A splendid addition to the gpuArray version of arrayfun is scalar expansion. Your input sizes must only be the same in every dimension OR 1. So for
C = arrayfun(@times, A, B);
if A is 1-by-10 and B is 4-by-1, C will be 4-by-10. If you think about it, this is just an extension of the rules for expanding scalars when you add or multiply an array by a scalar.
In fact, with R2016b just out, dimension expansion is now automatic for element-wise functions, on the GPU or CPU; so in the above case you can just go
C = A .* B;
and you'll get the same behaviour instead of an error.
Whatever happens, arrayfun creates a kernel which has one thread per element of the output array as defined by the sizes of the inputs. You can't create an output of any other size, and if you try, arrayfun will complain when it tries to compile your function. Trying to do vector operations is another way to get the compiler to error.
It's important to be aware that arrayfun is a very different thing on the GPU as the CPU. For CPU arrays, arrayfun is just a loop. If you pass it at least one gpuArray, it will know you wanted to run a kernel on the GPU and so will do the above. But what you were doing was passing in a simple CPU array as the argument to arrayfun and then accessing the gpuArray variables as up-values. That works, but you don't get a kernel out of it.
Up-values are a consequence of using nested functions, as you did in your original code. A nested function is allowed to access variables defined in the scope of the function it is nested inside. In the MATLAB editor these variables are highlighted in light blue.
If the function you pass to arrayfun is a nested function it can also use up-value variables, with two provisos: Firstly, you can only read from, not write into up-value variables; secondly, you can only read one element at a time.
If you want to know more about this stuff check out my blog post on ParallelForAll
D. Plotnick
D. Plotnick il 12 Ott 2016
Modificato: D. Plotnick il 12 Ott 2016
Thanks again, the article was also extremely helpful.
Another quick question more specific to the article (probably heading too far into the weeds here): are any of the basic scalar operations inherently faster than others e.g. subtraction faster than division?
The reason I ask is your path loss is often handled in the log (dB) domain to convert it to a subtraction. Other transmission losses and gains such as absorption, noise, and directional gain (which you handle as a brick wall in angle) are also calculated in this manner. From a hand calculation standpoint, this makes everything much easier but is there any particular reason one operation would be numerically preferable (stability, speed, etc.)?

Accedi per commentare.

Più risposte (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by