How can I modify this nested loop to work as a parfor without requiring a ridiculously large array?

6 visualizzazioni (ultimi 30 giorni)
I am trying to nest several for loops in a parfor loop. I have even read the documentation and several other queries/replies before posting here.
I have a large dataset that I need to iterate over, calculating a property which would result in a prohibitively large array if I sent each answer to a separate element, and doing this on one cpu will take a prohibitively long time.
Instead, what I am doing is reducing the array by another property, and then combining the calculated results in bins of this second property (essentially making a histogram).
The parfor section of my code looks like this:
parfor i=1:box(1);
for j=1:box(2);
for k=1:box(3);
for l=1:box(1);
for m=1:box(2);
for n=1:box(3);
prop(horrendousfunction(i,j,k,l,m,n)) = prop(horrendousfunction(i,j,k,l,m,n)) + data(i,j,k)*data(l,m,n);
end
end
end
end
end
end
Trialling this on one cpu over i=j=k=l=m=n=[1:15] works fine and takes a few minutes.
The data array is the initial large array, but as written the code will iterate over every element of it many times within each parfor step, and therefore the data transmission I/O overhead shouldn't be too onerous with respect to the overall computation time. (Expected complete running time is ~1 month on 1 cpu, and I have several of these to do)
A few other (small, negligible I/O) arrays are also required.
horrendousfunction is essentially a way of mapping (i,j,k,l,m,n) into a single index h which has been previously split into bins earlier in the code. I will eventually want to plot(h,prop).
Now, after spending some time trawling the documentation and various previous questions on the topic, I realise that my initial efforts contravene the basic parfor assumption that j,k,l,m, and n (as components of the index of prop) should be constant within the parfor loop. I see that if I wanted to flesh out a behemoth array A(i,j,k,l,m,n) I could do so by making dummy variables and then combining it all with a command = A(i,:,:,:,:,:) at the end - but this will create the stupidly large array that I wish to avoid (I don't have enough HDD to store it).
Simply calculating trick = horrendousfunction(i,j,k,l,m,n) within the loop is also not an option, because I can't use trick as the index of prop either, since its value changes within the loop too.
Predetermining the complete mapping of [i,j,k,l,m,n] -> h is also not an option, since that would require a behemoth data array of its own, and require said array to be I/O'd to all the matlabpool workers. (And calculating it would take about as long as the original problem anyway) Also, the mapping is context-dependent, so I can't use the same mapping across the related family of problems I wish to solve.
Is there another way I can write the internals of this loop to take advantage of MATLAB's parallelism without requiring a behemoth output array?
  8 Commenti
Matt J
Matt J il 15 Ott 2013
Modificato: Matt J il 15 Ott 2013
You still haven't said what the typical magnitudes of box(i) will be.
Do you mean that N is typically 100 and therefore prod(box) is typically 4e5? So the box(i) are around 75? That doesn't sound like a challengingly large data set. Less than 2MB if data(i) are single precision floats. You could comfortably broadcast a separate copy of "data" to each lab very quickly and without straining memory.
Daniel
Daniel il 15 Ott 2013
The problem isn't distributing data(). You're right, that's currently about 4e5 elements - although I expect it will change once I implement the cutoff section properly (haven't described that here, it's something I'll work on once this works successfully and it might result in the second set of three loops being smaller or larger). data() is an array of doubles, btw, but still only a few MB. I can't see it causing that much overhead on its own.
Part of the problem is in the relationships between each element - there are ~ (4e5)^2=16e10 elements in that set (and potentially less/more once I get the final version working as described above).
The other issue is that the box vector elements themselves are dependent on the output of a third-party code and are not so far controllable... They depend heavily on the particular system being studied and a priori there is no way to tell how large or small they might be. Having said that, the two systems I've looked at so far have them ranging between 60 and 80 - but they are intrinsically related, with corresponding actual vector lengths and angles.

Accedi per commentare.

Risposta accettata

Matt J
Matt J il 13 Ott 2013
Modificato: Matt J il 13 Ott 2013
I think I might use SPMD for this, instead of parfor, as below.
So the first modification to make is to abandon 6-dimensional indexing i,j,k,l,m,n. Replace subscripts (i,j,k) everywhere at the top level by a linear index u and (l,m,n) by a linear index v, so that your histogram update reduces simply to
prop(horrendousfunction(u,v)) = ...
prop(horrendousfunction(u,v)) + data(u)*data(v);
You can call ind2sub() inside horrendousfunction as needed with marginal overhead, since it's already horrendous.
The next step is to write a modified and vectorized version of horrendousfunction(). Namely you will write
h=horrendousVectorized(U,v)
to take a vector U of linear indices u and a scalar linear index v and return the appropriate vector of h indices, i.e., h(k)=horrendousfunction(U(k),v) for all k. This should be easy if, as you say,the calculation of h just involves abs() and simple arithmetic. All of those ops are nicely vectorizable for fixed v.
Then it's all just,
B=prod(box);
p=zeros(N,1);
D=distributed(data(:));
U=distributed((1:B).');
for v=1:B
datav=data(v);
spmd
h=horrendousVectorized(U,v);
vals=D*datav;
p=p+accumarray(h,vals,[N,1]);
end
end
spmd
p=gplus(p,1);
end
prop=gather(p{1});
  11 Commenti
Daniel
Daniel il 16 Ott 2013
You're right, I'd forgotten which index I was using. I knew there had to be a way of vectorising that command! Just couldn't see it.
Thanks again :)
I'll update the description with the new timing information once the first serious calculation is complete (it's running now).
Daniel
Daniel il 17 Ott 2013
Hmm... running the for loop for a test set took 89.483 seconds over 4 CPUs, but running the single-line vectorised command took 89.486 seconds over the same 4 CPUs. Not much difference there (and it probably comes from the other applications open on the laptop).

Accedi per commentare.

Più risposte (1)

Edric Ellis
Edric Ellis il 14 Ott 2013
Does it work to do something like this:
prop = zeros(1, N);
parfor ...
...
tmp = zeros(1, N);
tmp(horrendousfunction(i,j,k,l,m,n)) = data(i,j,k)*data(l,m,n);
prop = prop + tmp;
end
  3 Commenti
Daniel
Daniel il 15 Ott 2013
Modificato: Daniel il 15 Ott 2013
I've tried this out, and while it does allow the parfor loop to execute, the resulting task takes longer - at least for a test set of results.
To be fair, the first test set was rather small, only from 1:10 in each loop, which takes ~9.8s on one CPU in my laptop. It took ~41s across all four. I figured this was due to I/O overhead with loading data(i,j,k). The data perfectly replicated the single CPU job (so there wasn't any issue with each parfor worker overwriting prop when it finished).
A second test, 1:20 in each loop, runs for ~10m on one CPU. Ran for ~40m across all four. I am wondering if there is a timing issue with writing prop out from each worker/iteration?
Daniel
Daniel il 15 Ott 2013
See comment in reply to Matt above regarding parallelisation of prop = prop + tmp {or accumarray()};.

Accedi per commentare.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by