PBS: job killed: vmem 53439315968 exceeded limit 27917287424

16 visualizzazioni (ultimi 30 giorni)
When I run a Matlab parallel program on a Linux cluster, such as Parfor, it always prompts PBS: Job killed: VMEM 53439315968 exceeded limit 27917287424.But the memory used by the program is very small.The system automatically allocates about 5GB of memory to each worker, but in fact, when the program is running, the total program uses less than 5GB of memory.Therefore, when I use 8 workers, the system will allocate 50GB of memory first, which makes the program unable to run.In fact, the total memory used by the program is less than 5GB. This problem is not encountered on Windows, but only on Linux clusters.Hope someone can help to answer.
#!/bin/bash
#PBS -q batch
#PBS -j oe
#PBS -N matlab
#PBS -l nodes=1:ppn=16
#PBS -l walltime=72:00:00
cd $PBS_O_WORKDIR
NPROCS=`cat $PBS_NODEFILE|wc -l`
cat $PBS_NODEFILE > host.txt
module load matlab/R2017b
matlab -nodesktop -nosplash -r transverse_field_vs_H_errobar >out
echo 'done'
Here is my job script.
1. Each node of our cluster has two CPUs, and each CPU has eight cores.RAM=32GB.
2. We don't use cgroups.
3.
I may not have described my problem clearly. Now I will restate it and correct my clerical errors.There are two types of nodes in our cluster.One is the Batch node and the other is the bigmem node.The Batch node has 16 cores and RAM=32GB.The bigmem node has 32 cores and RAM=256GB.
"PBS: job killed: vmem 99439315968 exceeded limit 27917287424" when my program is running on the batch node (worker=16).When the same program is running on bigmem (worker=32, double the batch), the program can run and
use only 27GB RAM.That is to say, the system does not identify the main program, but automatically allocates a large amount of RAM to each worker in advance. In fact, the program uses very little RAM when it runs normally.Here is my program.
%----------------------------------------------------------------------%
J0=-15; J1=1; alpha=0.4; J2=alpha*J1; J3=J2; T=5; D=5; m=4; m_z=4;
%--------------------------------------------------------------------------%
B=100;
num=16;
x_orient=zeros(1,num);
chi_x=zeros(1,num);
h_cor=zeros(1,num);
change_times=zeros(1,num);
errx_chi=zeros(1,num);
errx_orient_x=zeros(1,num);
parpool(16)
parfor h_1=1:num
......
end
Starting parallel pool (parpool) using the 'local' profile ... connected to 16 workers.
ans =
Pool with properties:
Connected: true
NumWorkers: 16
Cluster: local
AttachedFiles: {}
IdleTimeout: 30 minute(s) (30 minutes remaining)
SpmdEnabled: true
When parpool is successfully connected, the system calls about 100GB RAM immediately.If the node RAM does not have 100GB, then parfor and the following programs will not be run.Parfor and the following programs will be run only if the node has sufficient RAM(100GB).It perplexs me why the system allocates 100GB RAM in advance.In fact, the program only needs 27GB of RAM.That is to say, the RAM allocated by the system is not determined by the main program(parfor----end), but by the number of workers in the parpool.
That's my problem. Thank you.
  3 Commenti
chengchen li
chengchen li il 15 Gen 2021
#!/bin/bash
#PBS -q batch
#PBS -j oe
#PBS -N matlab
#PBS -l nodes=1:ppn=16
#PBS -l walltime=72:00:00
cd $PBS_O_WORKDIR
NPROCS=`cat $PBS_NODEFILE|wc -l`
cat $PBS_NODEFILE > host.txt
module load matlab/R2017b
matlab -nodesktop -nosplash -r transverse_field_vs_H_errobar >out
echo 'done'
Thank you for your willingness to help me.Here is my job script.
1. Each node of our cluster has two CPUs, and each CPU has eight cores.RAM=32GB.
2. We don't use cgroups.
3.
I may not have described my problem clearly. Now I will restate it and correct my clerical errors.There are two types of nodes in our cluster.One is the Batch node and the other is the bigmem node.The Batch node has 16 cores and RAM=32GB.The bigmem node has 32 cores and RAM=256GB.
"PBS: job killed: vmem 99439315968 exceeded limit 27917287424" when my program is running on the batch node (worker=16).When the same program is running on bigmem (worker=32, double the batch), the program can run and
use only 27GB RAM.That is to say, the system does not identify the main program, but automatically allocates a large amount of RAM to each worker in advance. In fact, the program uses very little RAM when it runs normally.Here is my program.
%----------------------------------------------------------------------%
J0=-15; J1=1; alpha=0.4; J2=alpha*J1; J3=J2; T=5; D=5; m=4; m_z=4;
%--------------------------------------------------------------------------%
B=100;
num=16;
x_orient=zeros(1,num);
chi_x=zeros(1,num);
h_cor=zeros(1,num);
change_times=zeros(1,num);
errx_chi=zeros(1,num);
errx_orient_x=zeros(1,num);
parpool(16)
parfor h_1=1:num
......
end
Starting parallel pool (parpool) using the 'local' profile ... connected to 16 workers.
ans =
Pool with properties:
Connected: true
NumWorkers: 16
Cluster: local
AttachedFiles: {}
IdleTimeout: 30 minute(s) (30 minutes remaining)
SpmdEnabled: true
When parpool is successfully connected, the system calls about 100GB RAM immediately.If the node RAM does not have 100GB, then parfor and the following programs will not be run.Parfor and the following programs will be run only if the node has sufficient RAM(100GB).It perplexs me why the system allocates 100GB RAM in advance.In fact, the program only needs 27GB of RAM.That is to say, the RAM allocated by the system is not determined by the main program(parfor----end), but by the number of workers in the parpool.
That's my problem. Thank you.
Raymond Norris
Raymond Norris il 18 Gen 2021
Each node may have 32 GB, but the way PBS is setup on your cluster, it's only allowing for 26 GB. With 16 workers, that's approx 1.7 GB per worker. If you're job requires/uses 27 GB, it will fail.
You said this works on Windows, I'm assuming with a smaller pool size. What happens when you run this on Linux, with the same number of workers as you've run on Windows? How much data are you seeing being sent?
We could also try measuring how much data is being sent back and forth (though it won't tell us how much is being used on the workers) with ticBytes/tocBytes.
. . .
parpool(16);
ticBytes(gcp);
parfor h_1=1:num
......
end
tocBytes(gcp);
If you comment out the parfor loop, do you still see the 100 GB jump? If not, then it must be happening somewhere in the loop. Might need to see more of the parfor loop.

Accedi per commentare.

Risposte (0)

Categorie

Scopri di più su Parallel Computing Fundamentals in Help Center e File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by