Discarded Messages with SPMD and labReceive ... why?

2 views (last 30 days)
Hello,
I am using SPMD and trying to get some workers communicating w/ each other. There is a flag they need to send/receive. Whoever gets there job done and comitted first, sends out the flag, which the remaining workers should receive and therefore not commit their work.
Here is some abstact code that hopefully gets the point across of what I am trying to do. I would have thought the labBarrier at the bottom would have ensured all workers coming in 2nd place and after would have received the flag from the first workker finished. Some do, but .... I also get many of the warning messages similar to the following:
Lab 1:
Warning: An incoming message was discarded from lab 2 (tag: 2)
Indeed some workers are indeed missing the message, even if they finish seconds after that flag was sent out.
How does labSend work? I am missing something here?
----------------------------
% Emulating workers doing some variable time task
pause(randi([1 15]));
% See if other workers got their first and sent an update
for i=1:1:length(agentVec)
if i==labindex
Updates(i)=0;
else
if labProbe(i,2)
[Updates(i),srcWkrIdx,tag] = labReceive(i,2);
else
Updates(i) = 0;
end
end
end
if ~any(Updates)
% Commit work
flag = 1
else
% Otherwise take a nap
flag = 0
end
labSend(flag,agentVec(agentVec ~= labindex),2);
labBarrier;
  3 Comments
Edric Ellis
Edric Ellis on 20 Jul 2022
I would actually say exactly the opposite - MPI is (generally) very reliable and predictable. I shall post an answer with a suggestion as to how you might proceed.
In the code that you've written, each worker is guaranteed to labSend to each other worker. However, each worker is not guaranteed to labReceive from each other worker. There are guaranteed to be mismatched send/receives.

Sign in to comment.

Answers (1)

Edric Ellis
Edric Ellis on 20 Jul 2022
Using conditional receives in this way is not a robust way to get the workers to collaborate - you have an ordering problem that cannot be solved. I think you can probably achieve your goal by using one of the "reduction" functions which are designed to collect together results from multiple workers. In particular, you could try gcat to allow each worker to find out what happened on every other worker. gcat (effectively) collects values from all workers and concatenates them together on each worker. In this way, you don't need the labBarrier call either. Something a bit like this:
myResult = doSomeWork();
allResults = gcat(myResult);
% Now, choose what to do based on the results from all workers.
  1 Comment
EvanThomas
EvanThomas on 20 Jul 2022
Edited: EvanThomas on 20 Jul 2022
Thanks again for the feedback. Unfortunately, I'm not sure that will work for me, as it looks like SPMD waits until it here's back from all workers, which make sense given that it is concatenating all their responses and they are running asynchronously. So the function containing gcat won't complete until all worerks are done, if I understand things correctly. This takes me away from the asynchronous behavior I was needing at the next step.
For example, whenever Agent A is done it needs data from the workers that finished up to that point only. So, I would need a "partial" gcat, or some way to concatenate results from the subset of workers that finished only before Agent A. Not sure that is possible, though. Hopefully, my description makes sense
I felt like labSend and labReceive would be the only way to accomplish this. Unfortunately, that is not working, either.

Sign in to comment.

Categories

Find more on Cluster Configuration in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by