Distributed job validation passes but parallel job validation fails for Parallel Computation Toolbox.

2 visualizzazioni (ultimi 30 giorni)
Hi,
I am trying to use matlab parallel computation toolbox on a cluster. When I try to validate my scheduler configuration, the distributed job passes the validation but the parallel job fails with the following error:
Stage: Parallel Job
Status: Failed
Description: The given stage reached the default or user-specified timeout.
Command Line Output:
2346069.pbs001.palmetto.clemson.edu
Additionally I find the following error in the lob file on the cluster:
Node file: /var/spool/torque/aux//2346072.pbs001.palmetto.clemson.edu
Starting SMPD on node0218 node0219 node0275 node0276 ...
ssh node0218 "/opt/matlab-R2010a/bin/mw_smpd" -s -phrase MATLAB -port 26072
Warning: Permanently added 'node0218,10.125.1.218' (RSA) to the list of known hosts.^M
Permission denied, please try again.^M
Permission denied, please try again.^M
Permission denied (publickey,gssapi-with-mic,password).^M
Launching smpd failed for node: node0218
Stopping SMPD on ...
Exiting with code: 0
The settings which I have used for the scheduler are:
set(sched, 'ClusterMatlabRoot', '/opt/matlab-new');
set(sched, 'HasSharedFilesystem', true);
set(sched, 'ClusterOsType', 'unix');
set(sched, 'SubmitFcn',{@pbsNonSharedSimpleSubmitFcn,clusterHost, remoteDataLocation});
set(sched, 'ParallelSubmitFcn',{@pbsNonSharedParallelSubmitFcn, clusterHost, remoteDataLocation});
I have also setup a passwordless ssh connection using a rsa key. Could anyone tell me what is wrong with my configuration?
Thanks in advance.

Risposta accettata

Winston Yu
Winston Yu il 14 Mar 2011
John: There's two level passwordless ssh need to be setup, one from your desktop to the cluster submission node, this will allow you to run MDCS job. The second one is between the nodes inside the cluster, this will allow the parallel job which need MPI running. The set up is very similar as what you did for your desktop to the cluster submission node. Basically, you need to run ssh_keygen -t rsa to create your RSA key on one of your cluster node if you didn't do it already, then you need to append the public key you have created into $HOME/.ssh/authorized_keys file. This will allow all the nodes to log into each other if your $HOME is on a common share directory which every node have the access with

Più risposte (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by