Vectorize and use parfor in a nested loop with five levels

4 visualizzazioni (ultimi 30 giorni)
Hi all,
I'm working on an application that, for now, requires computing a nested loop with five levels. I'm trying to vectorize some parts and to use a parfor for one of the loops. However, I haven't been very successful. Any suggestions would be great. A simplified version of the code looks as follows,
clear all
clc
p1=(1:1:50)';
p2=p1;
np=max(size(p1));
[P1,P2]=meshgrid(p1,p2);
DP=unique(P1-P2);
K=max(size(DP));
ns=1000;
ne=10;
S1=randn(ns,ne);
S2=randn(ns,ne);
S3=randn(ns,ne);
S4=randn(ns,ne);
L1=rand(ns,ne);
L2=rand(ns,ne);
L3=rand(ns,ne);
L4=rand(ns,ne);
a=2;
BR=zeros(np,K,K,K,K);
P=BR;
for i=1:K,
DP1=DP(i);
C1=mean(a*S1*DP1>=0,2);
e1=bsxfun(@times,L1,(1-2*C1));
for j=1:K,
DP2=DP(j);
C2=mean(a*S2*DP2>=0,2);
e2=bsxfun(@times,L2,(1-2*C2));
for l=1:K,
DP3=DP(l);
C3=mean(a*S3*DP3>=0,2);
e3=bsxfun(@times,L3,(1-2*C3));
for m=1:K,
DP4=DP(m);
C4=mean(a*S4*DP4>=0,2);
e4=bsxfun(@times,L4,(1-2*C4));
for pp=1:np,
pr=p2(pp);
dp1=p1-pr;
A=reshape(-bsxfun(@times,a.*C4,dp1'),ns,1,np);
N=bsxfun(@minus,A,e4);
f=exp(N);
PR=f./(1+f);
rev=zeros(ns,ne,np);
for zz=1:np,
rev(:,:,zz)=PR(:,:,zz)*p1(zz);
end
er=reshape(sum(mean(rev,2)),np,1);
[mpr,ind]=max(er);
BR(pp,m,l,j,i)=p1(ind);
P(pp,m,l,j,i)=mpr;
end
end
end
end
end
In this version the problems start when evaluating BR and P in the inner loop. Even though all pp, m, l, j and i are defined, it takes a lot of memory and time to evaluate BR(pp,m,l,j,i)=p1(ind), and I really don't know why. Then, after solving that, I need to improve on vectorizing what might be possible to vectorize, and choose a loop to use the parfor.
Thanks,

Risposta accettata

Jan
Jan il 28 Giu 2013
Modificato: Jan il 30 Giu 2013
Some small improvements:
% Original code:
rev=zeros(ns,ne,np);
for zz=1:np,
rev(:,:,zz)=PR(:,:,zz)*p1(zz);
end
% Vectorized (is this faster?):
rev = bsxfun(@times, PR, reshape(p1, 1, 1, np));
Calculating sum(mean(rev,2)) contains 2 summations. Perhaps one sum is faster:
er = sum(reshape(rev, np, []), 2) / ne;
Avoir some multiplications by -1:
% Original:
dp1 = p1-pr;
A = reshape(-bsxfun(@times,a.*C4,dp1'),ns,1,np);
% New
dp1 = pr - p1;
A = reshape(bsxfun(@times, a.*C4, dp1'), ns, 1, np);
In these lines:
C1 = mean(a*S1*DP1>=0,2);
some simplifications are possible: a * S1 * DP1 >= 0 does not depend on the positivbe constant a. DP1 is a scalar and only its sign is interesting. So the meanS1 = mean(S1>=0, 2) can be calculated once and in case of a negative sign of DP1 1 - meanS1 can be calculated fast. DP1==0 must be caught also.
I'd start a parallelization after these mathematical simplifications have been performed. Avoiding calculations is simply more efficient than trying to do them faster.
[EDITED] Now I've found the time to run your code on my computer. Beside the fact that 4GB are not enough to run your code (how much RAM do you have?), a simple test revealed the bottleneck: Use the profiler to find out, which lines consumes the most time. You will see, that all my ideas are meaningless, because they improve parts of your code, which require only one percent of the total time. Only vectorizing the creation of rev matters significantly. But the rest is dominated by the very expensive calculation of:
f = exp(N);
So this is the point where improvements are required at first. N is created by BSXFUN and you can exploit, that exp(a - b) equals exp(a) / exp(b), and that exp(a * b) equals exp(a) ^ b. Then the innermost loop becomes:
exp_e4_inv = exp(e4);
exp_a_C4 = exp(a * C4);
for pp = 1:np
dp1 = p2(pp) - p1;
Ax = reshape(bsxfun(@power,exp_a_C4, dp1'), ns, 1, np);
f = bsxfun(@times, Ax, exp_e4_inv);
PR = f./(1+f);
rev = bsxfun(@times, PR, reshape(p1, 1, 1, np));
er = reshape(sum(reshape(rev, [], 3)) / ne, np, 1);
[mpr,ind]=max(er); % No changes here...
BR(pp,m,l,j,i)=p1(ind);
P(pp,m,l,j,i)=mpr;
end
This reduces the total number of elements the exponential function is applied to. For the much smaller problem defined by p1=(1:3)', I get 1.45 seconds runtime instead of 3.02 of the original version. The results differ by rounding errors inside the expected limits from the original. So some basic mathematical simplifications gained the double speed.
Well, double speed is not impressing by such a large problem. But I hesitate to start to parallelize code without considering this. Now this is a good point to come back to the original question: How do we PARFOR this?
  5 Commenti
Fernando
Fernando il 1 Lug 2013
Thanks Jan for the edition. I had realized that the costly part was computing the exponential but I had no idea how to improve on that. Once again, thanks! And to answer your question about RAM, I'm running this on a computer cluster so as long as I don't need more than 50gb, it should be OK.

Accedi per commentare.

Più risposte (0)

Categorie

Scopri di più su Loops and Conditional Statements in Help Center e File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by