Test in a suspension-like state with MDCS
3 visualizzazioni (ultimi 30 giorni)
Mostra commenti meno recenti
Hi,
Currently we are developing and testing a program in Matlab to model multi-physic problems in solid mechanics using FEM (Finite Element Method), depending on the complexity of the problem, protocols like MPI and OpenMP becomes a good choice to get fast results in a short time, hence, our tool was developed thinking in an efficient way to handle huge data transfer between process. However, nested FEM techniques like “FE^2” or other hierarchical approaches carry out an intensively data transfer between cores.
In that cases we'll find the best way to apply MPI protocol developed in Matlab, from the master node we distribute several tasks to each slave node (those tasks are computed in a serial way at this level) and finally, once each task are fully computed in each slave node, the data resulting is transferred form the slave to the master node in order to update the process, make some computations in the master node and send again the same number of tasks (updated tasks) from the master to the slave cores, this process is repeated several times depending of the problem.
From the computational point of view, all FEM tests are composed by meshes, every mesh is also composed by elements, and each element has “integration points”, the total amount of elements are splitted into several parts, each one is sent to one slave core, this problems are high scalable using MPI protocol.
Particularly, now I'm trying to run one test with ~2500 elements, each element equipped with 4 integration points, in each integration point the code solves also a BVP using FEM, we are using 220 cores for ~2500 elements (~11 finite elements per core), this process is repeated in 600 time steps, it takes more than 3 days with our cluster configuration, but the test collapses (for certain unknown reasons) after 4-12 hours. The test appears to be in a suspension-like state since neither ends nor displays errors. This is the portion of code where the test collapses:
parfor iElem = 1:nElem
e_VG_Aux21 = e_VG;
e_VG_Aux21.iElemSet = iElem;
e_VG_Aux21.iElemNum = m_NumElem(iElem);
condBif = m_VarAuxElem(p_condBif,iElem) ;
m_phi_grad = m_VarAuxElem(p_phi_grad,iElem) ;
m_phi_grad = reshape(m_phi_grad,4,2);
m_n_tens = m_VarAuxElem(p_n_tens,iElem) ;
m_n_tens = reshape(m_n_tens,4,2);
m_fii = m_VarAuxElem(p_fii,iElem) ;
m_injFactor_old = m_VarAuxElem(p_injFactor,iElem) ;
[m_Ke(:,:,iElem),m_Kbu_SDA(:,:,iElem),m_Kbu_MSL(:,:,iElem),m_KbbIN_SDA(:,:,iElem),...
m_KbbIN_MSL(:,:,iElem),m_Fint(:,iElem),m_Res_beta_SDA(:,iElem),...
m_Res_beta_MSL(:,iElem),sigma_new(:,iElem),hvar_new(:,iElem),...
eps_new(:,iElem),m_TensorTang(:,:,:,iElem),m_indSTmacro(:,iElem),...
m_indActSTmacro(:,iElem),elem_gamma(iElem),kSD(iElem),...
m_vectVHElem(:,iElem),m_stressTilde_new(:,iElem),m_dissipation_new(:,iElem),aux_var(:,iElem)] = ...
f_MatElem_quad_q1_SDA ...
(duElemSet(:,iElem),eps_old(:,iElem),Dbeta_SDA(:,iElem),Dbeta_MSL(:,iElem),...
aux_var(:,iElem),condBif,leq_elem(iElem) ,m_phi_grad,m_n_tens,m_fii,...
sigma_old(:,iElem),hvar_old(:,iElem),e_DatElemSet,e_DatMatSet,...
m_BT(:,:,:,iElem),m_DetJT(:,iElem),m_indSTmacro_old(iElem),m_indActSTmacro_old(iElem),...
m_injFactor_old,e_VG_Aux21,m_vectVHElem_old(:,iElem),fact_ESM,fact_inyect,fact_DGF,...
NumsidesCutCPF(iElem),m_stressTilde_old(:,iElem),m_dissipation_old(:,iElem));
end
Once the test collapses we can see the same log lines repeated every 10 minutes at java tasks logs produced by Matlab. It seems like slave nodes (10.0.0.17 at this example, same with the rest) want to send something to client (10.0.0.44):
2016 03 11 11:08:30.162 UTC | 6 | com.mathworks.toolbox.distcomp.pmode.io.DirectCommunicationGroup.doSelect: doSelect(): fresh select
2016 03 11 11:08:35.165 UTC | 6 | com.mathworks.toolbox.distcomp.pmode.io.DirectCommunicationGroup.doSelect: doSelect(): fresh select
2016 03 11 11:08:39.503 UTC | 5 | sendTo(list) - that's 1
2016 03 11 11:08:39.504 UTC | 5 | Enqueuing a message: KeepAliveMessage to: MatlabPoolPeerInstance{fUuid=2edf2443-8d91-4a48-97a7-1fa481bafaa1, fGroupUuid=166545ca-2c15-4039-a7bf-4257d49b9745, fLabIndex=-1, fNumberOfLabs=-1}
2016 03 11 11:08:39.505 UTC | 5 | OUTGOING Permit acquisition: heap: 0(wanted: 0) direct: 0(wanted: 0)
2016 03 11 11:08:39.505 UTC | 5 | OUTGOING permits available: heap: 51200, direct: 512000
2016 03 11 11:08:39.506 UTC | 5 | Enqueued message, queue size now: 1
2016 03 11 11:08:39.506 UTC | 4 | KeepAliveSender for MatlabPoolPeerInstance{fUuid=46090aae-2e16-45a6-b3df-88575fec085d, fGroupUuid=166545ca-2c15-4039-a7bf-4257d49b9745, fLabIndex=1, fNumberOfLabs=220} sent KeepAliveMessage to [MatlabPoolPeerInstance{fUuid=2edf2443-8d91-4a48-97a7-1fa481bafaa1, fGroupUuid=166545ca-2c15-4039-a7bf-4257d49b9745, fLabIndex=-1, fNumberOfLabs=-1}]
2016 03 11 11:08:39.506 UTC | 6 | com.mathworks.toolbox.distcomp.pmode.io.DirectCommunicationGroup.drainRunnableQueue: Running addInterestOps(SelectionKey.OP_WRITE) for TransmissionChannel{fConnection=PlainConnection{fSocketChannel=java.nio.channels.SocketChannel[connected local=/10.0.0.17:39424 remote=matlab/10.0.0.44:27370], fRemoteInstance=MatlabPoolPeerInstance{fUuid=2edf2443-8d91-4a48-97a7-1fa481bafaa1, fGroupUuid=166545ca-2c15-4039-a7bf-4257d49b9745, fLabIndex=-1, fNumberOfLabs=-1}, fJoinInfo=ServerSocketConnectInfo{fSocketAddress=matlab/10.0.0.44:27370, fGroupUuid=166545ca-2c15-4039-a7bf-4257d49b9745, fSecurityDescription=ConnectorPlainSecurityDescription, fJoinTimeLimit=60000, fDeadline=9223372036854775807, fConnectAttempts=5}}} on select thread
2016 03 11 11:08:39.516 UTC | 6 | com.mathworks.toolbox.distcomp.pmode.io.DirectCommunicationGroup.doSelect: doSelect(): fresh select
2016 03 11 11:08:39.516 UTC | 6 | com.mathworks.toolbox.distcomp.pmode.io.DirectCommunicationGroup.doSelect: doSelect(): Key 1307590776 writable
2016 03 11 11:08:39.517 UTC | 6 | com.mathworks.toolbox.distcomp.pmode.io.TransmissionChannel.handleWrite: Write: 272 to: PlainConnection{fSocketChannel=java.nio.channels.SocketChannel[connected local=/10.0.0.17:39424 remote=matlab/10.0.0.44:27370], fRemoteInstance=MatlabPoolPeerInstance{fUuid=2edf2443-8d91-4a48-97a7-1fa481bafaa1, fGroupUuid=166545ca-2c15-4039-a7bf-4257d49b9745, fLabIndex=-1, fNumberOfLabs=-1}, fJoinInfo=ServerSocketConnectInfo{fSocketAddress=matlab/10.0.0.44:27370, fGroupUuid=166545ca-2c15-4039-a7bf-4257d49b9745, fSecurityDescription=ConnectorPlainSecurityDescription, fJoinTimeLimit=60000, fDeadline=9223372036854775807, fConnectAttempts=5}}
2016 03 11 11:08:39.517 UTC | 5 | OUTGOING Permit release: heap: 0 direct: 0
2016 03 11 11:08:39.518 UTC | 6 | com.mathworks.toolbox.distcomp.pmode.io.TransmissionChannel$MessageHolder.releaseAcquiredPermits: Message size: 0 Kb sent in 12 ms
2016 03 11 11:08:39.518 UTC | 6 | com.mathworks.toolbox.distcomp.pmode.io.DirectCommunicationGroup.doSelect: doSelect(): fresh select
2016 03 11 11:08:39.519 UTC | 6 | com.mathworks.toolbox.distcomp.pmode.io.DirectCommunicationGroup.doSelect: doSelect(): Key 1307590776 writable
2016 03 11 11:08:39.519 UTC | 6 | com.mathworks.toolbox.distcomp.pmode.io.TransmissionChannel.handleWrite: removeInterestOps(SelectionKey.OP_WRITE) for PlainConnection{fSocketChannel=java.nio.channels.SocketChannel[connected local=/10.0.0.17:39424 remote=matlab/10.0.0.44:27370], fRemoteInstance=MatlabPoolPeerInstance{fUuid=2edf2443-8d91-4a48-97a7-1fa481bafaa1, fGroupUuid=166545ca-2c15-4039-a7bf-4257d49b9745, fLabIndex=-1, fNumberOfLabs=-1}, fJoinInfo=ServerSocketConnectInfo{fSocketAddress=matlab/10.0.0.44:27370, fGroupUuid=166545ca-2c15-4039-a7bf-4257d49b9745, fSecurityDescription=ConnectorPlainSecurityDescription, fJoinTimeLimit=60000, fDeadline=9223372036854775807, fConnectAttempts=5}}
2016 03 11 11:08:39.519 UTC | 6 | com.mathworks.toolbox.distcomp.pmode.io.DirectCommunicationGroup.doSelect: doSelect(): fresh select
2016 03 11 11:08:44.525 UTC | 6 | com.mathworks.toolbox.distcomp.pmode.io.DirectCommunicationGroup.doSelect: doSelect(): fresh select
The systems department can see MATLAB processes at sleep state at slave and client nodes, with no CPU activity but without freeing RAM. Also they can't see any error/warn lines at systems logs.
Our cluster has a 224 MDCS R2015b license installed and its managed with SLURM. Some cluster details are:
- Scientific Linux 6.4 as operating system
- 18 compute nodes, all with the same specs
- Login/Master node from where we start Matlab jobs and share by NFS a disk space
- All nodes are connected through infiniband and ethernet
I hope someone can help us with this trouble.
Thanks.
0 Commenti
Risposte (1)
Vedere anche
Prodotti
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!