How to speedup the following

1 view (last 30 days)
TheStranger
TheStranger on 13 Dec 2021
Edited: Jan on 15 Dec 2021
Hi there!
I have a problem: I am trying to speedup the calculation of some matrix (2D array).
Consider a rank-3 tensor of size NxNxN (3 dimensional array, if you wish) called A, and a vector of size N called D, and then I construct the two-dimensional array S in the following way:
A = rand(N, N, N);
D = rand(N, 1);
S = zeros(N);
for i = 1:N
for j = 1:N
S = S + A(:, :, i).*A(:, :, j)/(D(i) + D(j));
end
end
And afterwards, I solve the linear system involving S.
How to speed-up this?
A naive idea is to get rid of the explicit for loop over "j", and replace it with a summation. For instance, I can do the following:
A = rand(N, N, N);
D = rand(N, 1);
S = zeros(N);
Dj = repmat(D, 1, N, N);
Dj = permute(Dj, [3, 2, 1]);
for i = 1:N
S = S + sum(repmat(A(:, :, i), 1, 1, N).*A./(D(i) + Dj), 3)
end
Seems like a right way to go, in the spirit of vectorization, we can get rid of the internal second for loop.
However, it turns out the first version is faster for sufficiently large N than the second one (please, see the picture below). I understand that it is, probably, due to repmat, and sum(..., 3), but is there any way to speed-up this?
Thank you in advance!

Accepted Answer

Jan
Jan on 13 Dec 2021
Edited: Jan on 15 Dec 2021
N = 200;
A = rand(N, N, N);
D = rand(N, 1);
tic % Original
S = zeros(N);
for i = 1:N
for j = 1:N
S = S + A(:, :, i) .* A(:, :, j) / (D(i) + D(j));
end
end
toc
tic % Version 1
S = zeros(N);
Dj = repmat(D, 1, N, N);
Dj = permute(Dj, [3, 2, 1]);
for i = 1:N
S = S + sum(repmat(A(:, :, i), 1, 1, N) .* A ./ (D(i) + Dj), 3);
end
toc
tic % Version 2
S = zeros(N);
Dj = repmat(D, 1, N, N);
Dj = permute(Dj, [3, 2, 1]);
for i = 1:N % No need for REPMAT:
S = S + sum(A(:, :, i) .* A ./ (D(i) + Dj), 3);
end
toc
tic % Version 3
S = zeros(N);
for i = 1:N
Ai = A(:, :, i);
Di = D(i);
for j = 1:N
S = S + Ai .* A(:, :, j) / (Di + D(j));
end
end
toc
tic % Version 4:
S = zeros(N);
T = zeros(N);
for i = 1:N
Ai = A(:, :, i);
Di = D(i);
T(:) = 0;
for j = 1:N
T = T + A(:, :, j) ./ (Di + D(j));
end
S = S + Ai .* T;
end
toc
tic % Version 5a
S = zeros(N);
AC = num2cell(A, [1,2]);
for i = 1:N
Di = D(i);
for j = 1:N
S = S + AC{i} .* AC{j} / (Di + D(j));
end
end
toc
tic % Version 5b
S = zeros(N);
% Replace: AC = num2cell(A, [1,2]); by inlined (tiny advantage):
AC = cell(1, N);
for k = 1:N
AC{k} = A(:,:,k);
end
for i = 1:N
Di = D(i);
Ai = AC{i};
for j = 1:N
S = S + Ai .* AC{j} / (Di + D(j));
end
end
toc
% R2018b, Mobile i5, 16 GB RAM, N=200:
Elapsed time is 6.507656 seconds. % Original
Elapsed time is 6.926377 seconds. % Vectorized
Elapsed time is 4.615969 seconds. % Loops with copy of matrix
Elapsed time is 4.779539 seconds. % Why is this not faster?!
Elapsed time is 2.597977 seconds. % Split once by num2cell
Elapsed time is 2.450893 seconds. % Split with a loop
  7 Comments
Jan
Jan on 15 Dec 2021
Maybe Matlab's JIT acceleration can handle
A = rand(10, 10);
A(:, i) .* A(:, j)
without creating the two arrays A(:,i), A(:,j) explicitly. We do not know this, becuse the JIT is not documented. The MathWorks explains: "If we document the JIT, the users adjust their codes to the JIT, but we want to adjusat the JIT to the code of the users." Optimizing from two ends would cause confusions.
If A(:,j) is created explicitly, a new Matlab variable is created. This means a header of about 100 bytes containing, the class and dimensions, the number of chared copies and a pointer to the actual data as well as the actual data. This is the reason why A(:,:,j) needs time, while AC{j} is a cheap access of an existing variable.
James Tursa has implemented a smart function to share data with re-using parts of the original data: https://www.mathworks.com/matlabcentral/fileexchange/65842-sharedchild-creates-a-shared-data-copy-of-contiguous-subset . But even here the overhead for creating a Matlab variable is required.
In A C-Mex function you could access the data by using pointers and there is no overhead for creating a temporary variable. If this piece of code is the bottleneck of the complete code, it might be useful to create a Mex version.

Sign in to comment.

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by