HPC MATLAB parpool and speed

12 visualizzazioni (ultimi 30 giorni)
RUAN YY
RUAN YY il 25 Set 2020
Risposto: RUAN YY il 25 Set 2020
Hey guys! I am new to the HPCC. And I am now running my MATLAB program on it. I am using parellel computing, i.e. parpool
Here is the code for my "submit.sh"
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
/opt/hpc/MATLAB/R2019b/bin/matlab -nojvm -nodesktop -r "main_MultiEA;exit;"
The first thing is that I found the speed is similar to my local computer. Should I specify something in the .sh file to change this? And how can I know whether I reach the limit of the resource or not?
The second thing is that I found that the only available parpool is "local", using the "allNames = parallel.clusterProfiles()" command. Should it be different on the HPCC?
The third thing is that when I use "parpool(16)" or "parpool('local',16)" or "parpool("myPool",16)" etc.. to try to improve the speed, it the program seems to crash. Here is my test.m to test the parpool. And I guess the program crashes as there is no a.mat in the directory.
parpool("local",16);
a=0;
parfor i = 1:10
a = a+1;
end
save a.mat;
exit;
Would you tell me why's that? And how can I improve the speed? Thanks a lot!!

Risposta accettata

Raymond Norris
Raymond Norris il 25 Set 2020
Hi Ruan,
There are two ways to speed up your code, implicitly and explicitly. You don't have much control over implicitly. MATLAB will find the best ways to use your multi-cores. Explicitly, you can vectorize, pre-allocate, MEX-files, etc. You can also use parallel pools.
Looking at your Slurm job script, make the following change:
/opt/hpc/MATLAB/R2019b/bin/matlab -nojvm -nodesktop -r "main_MultiEA;exit;"
to
/opt/hpc/MATLAB/R2019b/bin/matlab -batch main_MultiEA
-batch works instead of -nodesktop, -r, "exit". And you'll need the JVM if you use PCT.
I'd also consider using module if you have it (your module name -- matlab -- might be slightly different)
module load matlab
matlab -batch main_MultiEA
Next, you're requesting from Slurm 2 nodes, with 2 cores per node (total of 4 cores). But MATLAB only runs on a single node, so the 2nd node is of no use. That means when you start the pool of 16 workers, you're running it on 2 cores (or you should be -- might depend if you have cgroups). This is probably why MATLAB is crashing -- you're running out of memory. To write this more flexibly, try
sz = getenv('SLURM_CPUS_PER_TASK');
parpool("local",sz);
a=0;
parfor i = 1:10
a = a+1;
end
save a.mat
This way, regardless of the cores per node you request, you'll get the right size.
With that said, there are two things to think about
  1. obviously, you'll see no speed up in your example. There has to be a reasonable amount of work to do.
  2. using the "local" profile, the parallel pool will only run "local" to wherever MATLAB is running (on the HPC compute node). If you want to run a larger pool, across nodes, then you'll need to create a Slurm profile with MATLAB Parallel Server.
Raymond
  3 Commenti
Raymond Norris
Raymond Norris il 25 Set 2020
test.m is calling save at the end, so when you call test, either via CLI or Slurm, you're going to generate a.mat. Do you not want the MAT-file to be generated? If not, simply comment out the line at the bottom of the file.
If this doesn't work
sz = getenv('SLURM_CPUS_PER_TASK');
then you might try
sz = getenv('SLURM_JOB_CPUS_PER_NODE');
What Slurm output/error file is being generated? If you're Slurm jobscript is only specifying the name of the job (Group3), it's possible
  • You're not requesting enough cores (16 or 17). Add #SBATCH -n 16
  • You're not requesting enough memory. Add #SBATCH --mem-per-cpu=2048
For instance:
#SBATCH -J Group3
#SBATCH -n 16 # Request 16 cores
#SBATCH --mem-per-cpu=2048 # Request 2 GB/core
/opt/hpc/MATLAB/R2019b/bin/matlab -batch test
Otherwise, please paste in the crash.
RUAN YY
RUAN YY il 25 Set 2020
Thank you very much! Let me try!
I want to use the a.mat file to see whether the program crashes or not. That's why I added that "dummy" statement.

Accedi per commentare.

Più risposte (1)

RUAN YY
RUAN YY il 25 Set 2020
I know why there is no .mat file output now.
[Warning: Objects of class 'parallel.cluster.Local' cannot be saved to MATfiles.]
I should check the slurm-JobID.out file, eg. slurm-21127.out
The print or warning or anything that is supposed to be output to your command line in your normal GUI will be stored int he slurm...file.

Categorie

Scopri di più su Third-Party Cluster Configuration in Help Center e File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by