How do I troubleshoot the "lost connection to worker X" parallel error?

42 visualizzazioni (ultimi 30 giorni)
How do I troubleshoot when I encounter the error:
The client lost connection to worker #. This might be due to network problems, or the interactive communicating job might have errored.
This error often preceded by a warning:
A worker aborted during execution of the parfor loop. The parfor loop will now run again on the remaining workers.
Before the full error is reported​:
All workers aborted during execution of the parfor loop. Error in mycode (line 19) parfor j = 1:n ​ The client lost connection to worker 3. This might be due to network problems, or the interactive communicating job might have errored.

Risposta accettata

MathWorks Support Team
MathWorks Support Team il 4 Lug 2025
Modificato: MathWorks Support Team il 4 Lug 2025
There are two key causes to this error: a crash, or a memory error. The easiest to rule out is whether the worker in question has crashed.

MATLAB Worker Crash

A crashed worker will leave behind a crash dump just like a normal MATLAB. On a cluster, this crash dump will be on the compute node hosting that worker.
Determine the location of any crash dumps from:
Crash dumps can also be located for non MATLAB Job Scheduler clusters in the following location:
>> c=parcluster() >> c.JobStorageLocation
In that location, look for the Job# folder for the job number which failed and access any "Job#.log" files.
Once you have the crash dump, examine the dump for further information. If the crash was in mex code you wrote, then it is worth running that mex code locally to check for issues. Otherwise, please contact Technical Support to assist with understanding and troubleshooting your crash.

Memory issues

If no crash dumps can be found, then memory issues are the likely cause.
On Linux this might be a worker being terminated by the OS for using too much memory, caused by a spike in memory usage as data is sent back from workers to the MATLAB client. This can happen when running very close to the minimum system requirements.
We recommend a minimum of 4 GB of RAM per worker (8GB if the worker is using Simulink) and then additional memory for the MATLAB client if it is also running on that system. Windows is generally less strict with memory usage and will leverage swap, which, though slower, may allow it to continue running.
As an initial troubleshooting step on Linux, try reducing the number of workers in the pool.

Network/Communication Issues

Network/communication issues are also a likely cause. When any of the machines are significantly slowed down by resource contention (e.g., memory swapping), this can delay communication signals between workers enough to disrupt the pool.
As well as the node slowdown, there is the additional chance of failure from network latency or connection dropping to contend with. At this point, try checking the network reliability or consider the step below.

Setting "SpmdEnabled" to false

An "SpmdEnabled" pool cannot continue once communication between workers or between workers and client has been lost. If you are using the local processes scheduler or MATLAB Job Scheduler and only using "parfor" and "parfeval" then you can instead specify the flag "SpmdEnabled" "false". See the documentation about "SpmdEnabled" for details:
A pool with "SpmdEnabled" set to false is unable to complete "spmd" statements.
With this option, the remaining workers will continue to complete the parallel work even after 1 worker has lost connection.
If you need further help and support with dealing with this error, please contact Technical Support.

Più risposte (0)

Categorie

Scopri di più su MATLAB Parallel Server in Help Center e File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by