Loops in parfor are overly slow in a very simple code

3 views (last 30 days)
I'm strugging to understand why a code that I wrote scales so poorly when ran in parallel using parpool.
The operation that I want to run in parallel is in the function unitLinear
function unitLinear(L,N)
a = rand(L,L,L);
b = rand(L,L,L);
for i=1:N
for j = 1:N
c = a.*b;
which does nothing useful, it is just a decoy to measure execution performance. ( If you are curious, it models the first step of Principal Component Analisis of a set of N volumes, each with LxLxL pixels, by computing all pairwise correlations)
My approach to run this function in parallel is with the following script testUnitLinear:
nTasks = 40; % number of taks to be executed in parallel
L = 128; % cube sidelength in pixels (inside testing units)
N = 100; % length of interior loop inside testing unit
fprintf('Results for L:%d N:%d \n',L,N);
% test apart
disp('Computing testing unit in single core');
tInitialUnit= clock();
timeUnitSingle= etime(clock(),tInitialUnit);
fprintf('Testing unit in single core: %5.2f \n',timeUnitSingle);
tUnitArray = zeros(nTasks,1); % to store the time seen inside the loop
%tUnitArray = distributed(tUnitArray);
t1 = clock();
parfor i=1:nTasks
tInitialUnit= clock();
timeUnit= etime(clock(),tInitialUnit);
tUnitArray(i) = timeUnit;
fprintf('Testing unit time %5.2f \n',timeUnit);
tTotal= -etime(t1,clock);
fprintf('Total time: %5.2f \n',tTotal);
fprintf('Sum process time: %5.2f \n',sum(tUnitArray));
fprintf('Average process time: %5.2f \n',sum(tUnitArray)/nTasks);
fprintf('Unit in single core: %5.2f \n',timeUnitSingle);
When ran outside the parfor, the first execution of unitLinear took about 10 seconds in my system... and inside the parfoor loop (using a parpool openend with the local profile), each execution was indeed reporting about 10 seconds. So far so good. As I was working with a open pool of 16 workers (as seen here)
>> a = gcp
a =
Pool with properties:
Connected: true
NumWorkers: 16
Cluster: local
AttachedFiles: {}
AutoAddClientPath: true
IdleTimeout: 30 minutes
SpmdEnabled: false
.... I was expecting that the execution of the parfor loop would amount to (10 seconds per task X 40 tasks) / 16 workers = approx 25 seconds. However, the time measured by the user was around 400 seconds! As if no parallelization would take place at all! Even if htop was reporting all requested cores working (I have no other processes running in this machine)... and the fans were in fact busy as hell.
This is something that I didn't expect. I kinow that a speedup of celan 16x would be asking too much, but a speed up of 1x on 16 local cores is too unexpected, as the parfor loop couldn't bee more simple. No files, no shared variables... nothing I can think of... or am I missing something too evident? My main problem is that I cannot distinguish if this is somehow expected behavior or the symptom of something going terribly wrong in my system...
Any help is welcome!
I paste here the results of executing the code....
>> tryUnitLinear
Results for L:128 N:100
Computing testing unit in single core
Testing unit in single core: 9.83
Testing unit 40 time 9.77
Testing unit 39 time 9.68
Testing unit 38 time 9.47
Testing unit 37 time 9.57
Testing unit 36 time 9.83
Testing unit 35 time 9.67
Testing unit 34 time 10.01
Testing unit 33 time 9.62
Testing unit 32 time 10.54
Testing unit 31 time 10.40
Testing unit 30 time 11.69
Testing unit 29 time 9.54
Testing unit 28 time 9.51
Testing unit 27 time 9.59
Testing unit 26 time 9.80
Testing unit 25 time 10.23
Testing unit 24 time 10.14
Testing unit 23 time 10.17
Testing unit 22 time 9.37
Testing unit 21 time 9.33
Testing unit 20 time 10.05
Testing unit 19 time 11.25
Testing unit 18 time 10.37
Testing unit 17 time 9.73
Testing unit 16 time 9.95
Testing unit 15 time 10.30
Testing unit 14 time 9.41
Testing unit 13 time 10.47
Testing unit 12 time 10.27
Testing unit 11 time 9.52
Testing unit 10 time 9.67
Testing unit 9 time 9.73
Testing unit 8 time 12.17
Testing unit 7 time 9.96
Testing unit 6 time 10.17
Testing unit 5 time 9.81
Testing unit 4 time 9.92
Testing unit 3 time 9.33
Testing unit 2 time 9.35
Testing unit 1 time 9.70
Total time: 469.00
Sum process time: 399.08
Average process time: 9.98
in single core: 9.83
(by the way, I find it rather weird that the parfor visits i in exactly the reverse order of integers -I was expecting a totally random access pattern- but cannot imagine if it has some relationship with the problem I described)
thanks in advance!

Accepted Answer

Jacob Wood
Jacob Wood on 18 Feb 2020
Matlab actually multithreads element-wise multiplication, thus using all available cores in the "single core" case and no additional performance from the parfor implementation. See this link for more information:
Arabarra on 19 Feb 2020
update: I restarted the computers and you were right: htop now shows me that the matrix multiplication is multithreading like hell. Well, not the result I needed (as it means I cannot gain computing time :-(), but at least I know why... thanks!

Sign in to comment.

More Answers (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by