How do I troubleshoot the "lost connection to worker X" parallel error?

Question

MathWorks Support Team il 19 Apr 2017

1
Link

Link diretto a questa domanda

https://it.mathworks.com/matlabcentral/answers/336076-how-do-i-troubleshoot-the-lost-connection-to-worker-x-parallel-error

Modificato: MathWorks Support Team il 4 Lug 2025

Risposta accettata: MathWorks Support Team

How do I troubleshoot when I encounter the error:

The client lost connection to worker #. This might be due to network problems, or the interactive communicating job might have errored.

This error often preceded by a warning:

A worker aborted during execution of the parfor loop. The parfor loop will now run again on the remaining workers.

Before the full error is reported:

All workers aborted during execution of the parfor loop.
Error in mycode (line 19) parfor j = 1:n
​ The client lost connection to worker 3. This might be due to network problems, or the interactive communicating job might have errored.

Accedi per rispondere a questa domanda.

Answer 1

MathWorks Support Team il 4 Lug 2025

1
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/336076-how-do-i-troubleshoot-the-lost-connection-to-worker-x-parallel-error#answer_263567

Modificato: MathWorks Support Team il 4 Lug 2025

Apri in MATLAB Online

There are two key causes to this error: a crash, or a memory error. The easiest to rule out is whether the worker in question has crashed.

MATLAB Worker Crash

A crashed worker will leave behind a crash dump just like a normal MATLAB. On a cluster, this crash dump will be on the compute node hosting that worker.

Determine the location of any crash dumps from:

How do I locate the crash dump files generated by MATLAB?

Crash dumps can also be located for non MATLAB Job Scheduler clusters in the following location:

>> c=parcluster()
>> c.JobStorageLocation

In that location, look for the Job# folder for the job number which failed and access any "Job#.log" files.

Once you have the crash dump, examine the dump for further information. If the crash was in mex code you wrote, then it is worth running that mex code locally to check for issues. Otherwise, please contact Technical Support to assist with understanding and troubleshooting your crash.

Memory issues

If no crash dumps can be found, then memory issues are the likely cause.

On Linux this might be a worker being terminated by the OS for using too much memory, caused by a spike in memory usage as data is sent back from workers to the MATLAB client. This can happen when running very close to the minimum system requirements.

We recommend a minimum of 4 GB of RAM per worker (8GB if the worker is using Simulink) and then additional memory for the MATLAB client if it is also running on that system. Windows is generally less strict with memory usage and will leverage swap, which, though slower, may allow it to continue running.

As an initial troubleshooting step on Linux, try reducing the number of workers in the pool.

Network/Communication Issues

Network/communication issues are also a likely cause. When any of the machines are significantly slowed down by resource contention (e.g., memory swapping), this can delay communication signals between workers enough to disrupt the pool.

As well as the node slowdown, there is the additional chance of failure from network latency or connection dropping to contend with. At this point, try checking the network reliability or consider the step below.

Setting "SpmdEnabled" to false

An "SpmdEnabled" pool cannot continue once communication between workers or between workers and client has been lost. If you are using the local processes scheduler or MATLAB Job Scheduler and only using "parfor" and "parfeval" then you can instead specify the flag "SpmdEnabled" "false". See the documentation about "SpmdEnabled" for details:

https://www.mathworks.com/help/parallel-computing/parpool.html

A pool with "SpmdEnabled" set to false is unable to complete "spmd" statements.

With this option, the remaining workers will continue to complete the parallel work even after 1 worker has lost connection.

If you need further help and support with dealing with this error, please contact Technical Support.

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Accedi per commentare.

How do I troubleshoot the "lost connection to worker X" parallel error?

Risposta accettata

MATLAB Worker Crash

Memory issues

Network/Communication Issues

Setting "SpmdEnabled" to false

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Più risposte (0)

Vedere anche

Categorie

Tag

Prodotti

Community Treasure Hunt

How do I troubleshoot the "lost connection to worker X" parallel error?

Risposta accettata

MATLAB Worker Crash

Memory issues

Network/Communication Issues

Setting "SpmdEnabled" to false

0 Commenti Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Più risposte (0)

Vedere anche

Categorie

Tag

Prodotti

Community Treasure Hunt

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti