Azzera filtri
Azzera filtri

Parfor: worker aborted during execution of the parfor loop

175 visualizzazioni (ultimi 30 giorni)
When running my parfor loop on a remote cluster (with 16 c5.xlarge, 2 core machines and a dedicated headnode m5.xlarge, 2 core) I get following error:
Warning: A worker aborted during execution of the parfor loop. The parfor loop will now run again on the remaining workers.
> In distcomp/remoteparfor/handleIntervalErrorResult (line 245)
In distcomp/remoteparfor/getCompleteIntervals (line 395)
In parallel_function>distributed_execution (line 746)
In parallel_function (line 578)
In FUN_CLUSTER_FORECASTING (line 54)
In parallel.internal.cluster.executeFunction (line 29)
In parallel.internal.evaluator.evaluateWithNoErrors (line 14)
In parallel.internal.evaluator/MJSStreamingEvaluator/evaluate (line 40)
In dctEvaluateTask>iEvaluateTask/nEvaluateTask (line 354)
In dctEvaluateTask>iEvaluateTask (line 175)
In dctEvaluateTask (line 81)
In distcomp_evaluate_task>iDoTask (line 152)
In distcomp_evaluate_task (line 74)
In distcomp_evaluate_task_mvm (line 39)
Sending a stop signal to all the labs...
Y is of size (226 × 440) when I get the error. The parfor loop runs in smaller specifications without any problems (does not fail to compute when Y is of size (226 × 120)).
A simplified version of the parfor loop:
% initialize the output variable
forecast = zeros(T-T_thres+1,h,length(series_to_eval),1);
% irep is an array, e.g. irep = [113,114,..,214]
irep = T_thres:T-h;
parfor (ij = 1:length(irep))
fun = BCTRVAR(Y(1:irep(ij),:),h,series_to_eval);
% h is the forecast horizon, e.g. h = [1,2,..,12]
for ii = 1:h
forecast(ij,ii,:,:) = fun(:,ii,series_to_eval);
end
end
I am not sure if the following warning is related but I get a warning on the variable Y:
'The entire array or structure Y is a broadcast variable. This might result in unnecessary communication overhead.'
Could the overhead be the cause of the issue?
  4 Commenti
Merlin Scherer
Merlin Scherer il 5 Gen 2022
The problem does not reproduce when I run the batch on a 'local' cluster.
The parfor loop runs without any problems on the cloud cluster when I change the function inside the parfor loop.
To see how much data was transmited to the workers, I tested the parfor loop on the 'local' cluster with another function instead of BCTRVAR(.):
BytesSentToWorkers BytesReceivedFromWorkers
1 2.0043e+07 1.93e+05
2 2.0004e+07 1.8097e+05
3 1.9247e+07 1.6893e+05
Total 5.9294e+07 5.429e+05
Is this a large amount of data being transferred?
Merlin Scherer
Merlin Scherer il 7 Gen 2022
Modificato: Merlin Scherer il 7 Gen 2022
@Edric Ellis are there ways to get a more detailed error message?
as you suggested in Problem with parfor loop I ran the remote cluster also when only requesting one worker. This did not generate the problem. Do you have any thoughts or ideas what this could mean?

Accedi per commentare.

Risposta accettata

Merlin Scherer
Merlin Scherer il 9 Gen 2022
Modificato: Merlin Scherer il 9 Gen 2022
I solved the problem by changing the worker machine type from c5.xlarge with 4 GB/core to m5.xlarge with 8 GB/core. So I think that the workers must have had insufficient memory.
This answer to the question
"Please suggest me to write the parallel loop in MATLAB without the workers getting aborted during the course of execution?"
by Raymond Norris helped me figure this out.
  3 Commenti
Tengyuan Hao
Tengyuan Hao il 14 Nov 2022
I increased the memory and the problem is solved. I increase from 10GB of memory per core (slot), to 20GB of memory per core (slot),
Abhilash
Abhilash il 27 Nov 2023
It would be awesome if you could write out the steps you took to solve the issue. The link is broken, and I went to the thread, but it does not help at all.
Please please, programmers, when you find a solution, just write it out!

Accedi per commentare.

Più risposte (0)

Categorie

Scopri di più su MATLAB Parallel Server in Help Center e File Exchange

Prodotti


Release

R2021b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by