What is the origin of this bus error?

31 visualizzazioni (ultimi 30 giorni)
Wouter
Wouter il 1 Ott 2019
Risposto: Raymond Norris il 4 Lug 2020
I had been running some monte-carlo simulations on a cluster node (Linux) for over a week using parfor, when a crash happened at about 70% done (time evolution, so the problem does not become progressively harder), and I don't understand the report. Luckily I saved some intermediate results, but I would prefer to have an idea of what went wrong before I try again. In principle, all code in the script has been accessed before on the same machine without troubles.
The error is the following:
[Warning: A worker aborted during execution of the parfor loop. The parfor loop
will now run again on the remaining workers.]
[> In parallel_function (line 599)
In seekGdeptransition_forcluster_Nrealdep (line 51)]
--------------------------------------------------------------------------------
Bus error detected at Sat Sep 28 05:55:53 2019 +0200
--------------------------------------------------------------------------------
Configuration:
Crash Decoding : Disabled - No sandbox or build area path
Crash Mode : continue (default)
Default Encoding : UTF-8
Deployed : false
GNU C Library : 2.17 stable
Graphics Driver : Unknown software
Java Version : Java 1.8.0_144-b01 with Oracle Corporation Java HotSpot(TM) 64-Bit Server VM mixed mode
MATLAB Architecture : glnxa64
MATLAB Entitlement ID : 815978
MATLAB Root : /ssoft/spack/external/MATLAB/R2018a
MATLAB Version : 9.4.0.813654 (R2018a)
OpenGL : software
Operating System : "Red Hat Enterprise Linux Server release 7.6 (Maipo)"
Process ID : 18832
Processor ID : x86 Family 6 Model 79 Stepping 1, GenuineIntel
Session Key : db19bbbe-1534-4337-b32d-f6c8548df595
Static TLS mitigation : Disabled: Unable to open display
Window System : No active display
Fault Count: 1
Abnormal termination
Register State (from fault):
RAX = 00002ac3ad3a2c40 RBX = 0000000000000000
RCX = 00002ac37e0e2d12 RDX = 0000000000000000
RSP = 00002ac3d650b878 RBP = 00002ac3d650b8e0
RSI = 0000000000000000 RDI = 00002ac3b2f1ef50
R8 = 00002ac3b2f1ef28 R9 = 0000000000000000
R10 = 00002ac3d650b8a0 R11 = 0000000000000000
R12 = 000000000000006e R13 = 00002ac3b2f1ef00
R14 = 00002ac3b2f1ef50 R15 = 00002ac3b2f1ef28
RIP = 00002ac3ac643fd0 EFL = 0000000000010202
CS = 0033 FS = 0000 GS = 0000
Stack Trace (from fault):
[ 0] 0x00002ac3ac643fd0 /ssoft/spack/external/MATLAB/R2018a/sys/java/jre/glnxa64/jre/lib/amd64/server/libjvm.so+02228176
[ 1] 0x00002ac3acd4cad0 /ssoft/spack/external/MATLAB/R2018a/sys/java/jre/glnxa64/jre/lib/amd64/server/libjvm.so+09603792
[ 2] 0x00002ac3acd0815e /ssoft/spack/external/MATLAB/R2018a/sys/java/jre/glnxa64/jre/lib/amd64/server/libjvm.so+09322846
[ 3] 0x00002ac3acd08726 /ssoft/spack/external/MATLAB/R2018a/sys/java/jre/glnxa64/jre/lib/amd64/server/libjvm.so+09324326
[ 4] 0x00002ac3ace96c01 /ssoft/spack/external/MATLAB/R2018a/sys/java/jre/glnxa64/jre/lib/amd64/server/libjvm.so+10955777
[ 5] 0x00002ac3ace9843e /ssoft/spack/external/MATLAB/R2018a/sys/java/jre/glnxa64/jre/lib/amd64/server/libjvm.so+10961982
[ 6] 0x00002ac3acd4e338 /ssoft/spack/external/MATLAB/R2018a/sys/java/jre/glnxa64/jre/lib/amd64/server/libjvm.so+09610040
[ 7] 0x00002ac37e0dedd5 /lib64/libpthread.so.0+00032213
[ 8] 0x00002ac37c86502d /lib64/libc.so.6+01040429 clone+00000109
[ 9] 0x0000000000000000 <unknown-module>+00000000
** This crash report has been saved to disk as /home/wverstra/matlab_crash_dump.18832-1 **
MATLAB is exiting because of fatal error
/var/spool/slurmd/job2941726/slurm_script: line 13: 18832 Killed matlab -nodisplay -r "seekGdeptransition_forcluster_Nrealdep(10,100);quit"
FINISHED at Sat Sep 28 05:55:54 CEST 2019
slurmstepd: error: Detected 2 oom-kill event(s) in step 2941726.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
note that line 51 of file "seekGdeptransition_forcluster_Nrealdep.m" is just
parfor rr=1:Nreal

Risposta accettata

Daniel M
Daniel M il 19 Ott 2019
Seems like you are running too many processes and ran out of memory. I've had this happen before and I just needed to limit my parpool to a smaller size.

Più risposte (1)

Raymond Norris
Raymond Norris il 4 Lug 2020
Hi,
When you submit your Slurm job, you can specify the flag
--mem-per-cpu=<mem, usually in gb>
look to increase that. If you need to run on more cores/nodes, try running the MATLAB Parallel Server, which expands past a single node. Contact support@mathworks.com for more information on MATLAB Parallel Server or help with configuring your Slurm job.

Categorie

Scopri di più su Cluster Configuration in Help Center e File Exchange

Prodotti


Release

R2018a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by