PBS: job killed: vmem 53439315968 exceeded limit 27917287424

Question

chengchen li il 14 Gen 2021

0
Link

Link diretto a questa domanda

https://it.mathworks.com/matlabcentral/answers/716558-pbs-job-killed-vmem-53439315968-exceeded-limit-27917287424

Commentato: Raymond Norris il 18 Gen 2021

When I run a Matlab parallel program on a Linux cluster, such as Parfor, it always prompts PBS: Job killed: VMEM 53439315968 exceeded limit 27917287424.But the memory used by the program is very small.The system automatically allocates about 5GB of memory to each worker, but in fact, when the program is running, the total program uses less than 5GB of memory.Therefore, when I use 8 workers, the system will allocate 50GB of memory first, which makes the program unable to run.In fact, the total memory used by the program is less than 5GB. This problem is not encountered on Windows, but only on Linux clusters.Hope someone can help to answer.

#!/bin/bash
#PBS -q batch
#PBS -j oe 
#PBS -N matlab
#PBS -l nodes=1:ppn=16
#PBS -l walltime=72:00:00
cd $PBS_O_WORKDIR
NPROCS=`cat $PBS_NODEFILE|wc -l`
cat $PBS_NODEFILE > host.txt
module load matlab/R2017b
matlab -nodesktop -nosplash -r transverse_field_vs_H_errobar >out
echo 'done'

Here is my job script.

1. Each node of our cluster has two CPUs, and each CPU has eight cores.RAM=32GB.

2. We don't use cgroups.

3.

I may not have described my problem clearly. Now I will restate it and correct my clerical errors.There are two types of nodes in our cluster.One is the Batch node and the other is the bigmem node.The Batch node has 16 cores and RAM=32GB.The bigmem node has 32 cores and RAM=256GB.

"PBS: job killed: vmem 99439315968 exceeded limit 27917287424" when my program is running on the batch node (worker=16).When the same program is running on bigmem (worker=32, double the batch), the program can run and

use only 27GB RAM.That is to say, the system does not identify the main program, but automatically allocates a large amount of RAM to each worker in advance. In fact, the program uses very little RAM when it runs normally.Here is my program.

%----------------------------------------------------------------------%
J0=-15; J1=1;  alpha=0.4;  J2=alpha*J1;  J3=J2;  T=5;   D=5;   m=4;  m_z=4;    
%--------------------------------------------------------------------------%
B=100;
num=16;
x_orient=zeros(1,num);
chi_x=zeros(1,num);
h_cor=zeros(1,num);
change_times=zeros(1,num);
errx_chi=zeros(1,num);
errx_orient_x=zeros(1,num);
parpool(16)
parfor h_1=1:num
    ......
end

Starting parallel pool (parpool) using the 'local' profile ... connected to 16 workers.
ans = 
 Pool with properties: 
            Connected: true
           NumWorkers: 16
              Cluster: local
        AttachedFiles: {}
          IdleTimeout: 30 minute(s) (30 minutes remaining)
          SpmdEnabled: true

When parpool is successfully connected, the system calls about 100GB RAM immediately.If the node RAM does not have 100GB, then parfor and the following programs will not be run.Parfor and the following programs will be run only if the node has sufficient RAM(100GB).It perplexs me why the system allocates 100GB RAM in advance.In fact, the program only needs 27GB of RAM.That is to say, the RAM allocated by the system is not determined by the main program(parfor----end), but by the number of workers in the parpool.

That's my problem. Thank you.

3 Commenti
Mostra 1 commento meno recenteNascondi 1 commento meno recente

chengchen li il 15 Gen 2021

#!/bin/bash
#PBS -q batch
#PBS -j oe 
#PBS -N matlab
#PBS -l nodes=1:ppn=16
#PBS -l walltime=72:00:00
cd $PBS_O_WORKDIR
NPROCS=`cat $PBS_NODEFILE|wc -l`
cat $PBS_NODEFILE > host.txt
module load matlab/R2017b
matlab -nodesktop -nosplash -r transverse_field_vs_H_errobar >out
echo 'done'

Thank you for your willingness to help me.Here is my job script.

1. Each node of our cluster has two CPUs, and each CPU has eight cores.RAM=32GB.

2. We don't use cgroups.

3.

I may not have described my problem clearly. Now I will restate it and correct my clerical errors.There are two types of nodes in our cluster.One is the Batch node and the other is the bigmem node.The Batch node has 16 cores and RAM=32GB.The bigmem node has 32 cores and RAM=256GB.

"PBS: job killed: vmem 99439315968 exceeded limit 27917287424" when my program is running on the batch node (worker=16).When the same program is running on bigmem (worker=32, double the batch), the program can run and

use only 27GB RAM.That is to say, the system does not identify the main program, but automatically allocates a large amount of RAM to each worker in advance. In fact, the program uses very little RAM when it runs normally.Here is my program.

%----------------------------------------------------------------------%
J0=-15; J1=1;  alpha=0.4;  J2=alpha*J1;  J3=J2;  T=5;   D=5;   m=4;  m_z=4;    
%--------------------------------------------------------------------------%
B=100;
num=16;
x_orient=zeros(1,num);
chi_x=zeros(1,num);
h_cor=zeros(1,num);
change_times=zeros(1,num);
errx_chi=zeros(1,num);
errx_orient_x=zeros(1,num);
parpool(16)
parfor h_1=1:num
    ......
end

Starting parallel pool (parpool) using the 'local' profile ... connected to 16 workers.
ans = 
 Pool with properties: 
            Connected: true
           NumWorkers: 16
              Cluster: local
        AttachedFiles: {}
          IdleTimeout: 30 minute(s) (30 minutes remaining)
          SpmdEnabled: true

When parpool is successfully connected, the system calls about 100GB RAM immediately.If the node RAM does not have 100GB, then parfor and the following programs will not be run.Parfor and the following programs will be run only if the node has sufficient RAM(100GB).It perplexs me why the system allocates 100GB RAM in advance.In fact, the program only needs 27GB of RAM.That is to say, the RAM allocated by the system is not determined by the main program(parfor----end), but by the number of workers in the parpool.

That's my problem. Thank you.

Raymond Norris il 18 Gen 2021

Each node may have 32 GB, but the way PBS is setup on your cluster, it's only allowing for 26 GB. With 16 workers, that's approx 1.7 GB per worker. If you're job requires/uses 27 GB, it will fail.

You said this works on Windows, I'm assuming with a smaller pool size. What happens when you run this on Linux, with the same number of workers as you've run on Windows? How much data are you seeing being sent?

We could also try measuring how much data is being sent back and forth (though it won't tell us how much is being used on the workers) with ticBytes/tocBytes.

. . .
parpool(16);
ticBytes(gcp);
parfor h_1=1:num
    ......
end
tocBytes(gcp);

If you comment out the parfor loop, do you still see the 100 GB jump? If not, then it must be happening somewhere in the loop. Might need to see more of the parfor loop.

Accedi per commentare.

Accedi per rispondere a questa domanda.