- Are there any dependent files that you didn't attach? From your example, it looks like the answer is "no", but your real code may depend on a function that isn't included or can't be found -- even if it's a trivial function.
- Is it possible you are shadowing something with your function names?
Why is rand() causing a crash in parallel batch jobs?
3 visualizzazioni (ultimi 30 giorni)
Mostra commenti meno recenti
Hi,
We have a user on our cluster that keeps getting crashes when he attempts to run his matlab job as a batch parallel job. I've narrowed the problem down to the rand() function.
If I do the following interactively from the GUI, all is well:
n = 10000
A = rand(n,n)
This should yield a 100 million element matrix. Now, if I add those two lines to 'test.m' and then pass it to a batch job like so:
job = batch('test')
the job runs successfully, but if I attempt to run a parallel batch job (with the Parallel Computing Toolbox) then I get the following:
*****
>> job = batch('test', 'matlabpool', 2)
...
>> wait(job)
>> load(job)
Warning: Variable 'errormessage' not found.
> In distcomp.fileserializer.getFields>iLoadMat at 92
In distcomp.fileserializer.getFields at 70
In distcomp.fileserializer.getField at 11
In distcomp.abstracttask.pGetErrorMessage at 15
In distcomp.abstractjob.load at 51
Warning: Variable 'errormessage' not found.
> In distcomp.fileserializer.getFields>iLoadMat at 92
In distcomp.fileserializer.getFields at 70
In distcomp.fileserializer.getField at 11
In distcomp.abstracttask.pGetErrorMessage at 15
In distcomp.abstractjob.load at 51
Error using distcomp.abstractjob/load (line 54)
The job failed to run correctly but no error message was returned.
This could be because the scheduler failed to start MATLAB correctly on
the cluster, or because the files needed to run the batch job were
unavailable to the MATLAB on the cluster. You may find more information
in the debug log for this job. To find out more about debug logs look at
the getDebugLog function in the documentation.
*****
I hope you'll forgive the example. It's obviously not useful to run the program above on several processes in parallel; it's just to test.
The job obviously crashes, but Matlab doesn't have anything useful to say, as if it wasn't able to handle the error itself. The output from using getDebugLog as it suggested didn't yield anything useful either. This lead me to believe that perhaps a limitation was hit in the JVM, thus the crash is happening outside of Matlab
I read in the troubleshooting section of the manual that the maximum object size that the JVM allows is 2GB, so I connected to one of the processes via VisualVM to watch the heap, etc, and it doesn't seem to be getting over 100MB. It's possible I'm looking in the wrong place, or at the wrong time.
So I'm at a loss of how to troubleshoot this further, or at least understand why it's happening.
Error, or memory limit?? Can anyone shed light on why this is happening, how to solve it, or possibly give an explanation of what limit is being hit?
Thanks in advance for any help.
0 Commenti
Risposte (1)
Jason Ross
il 27 Nov 2012
Modificato: Jason Ross
il 27 Nov 2012
6 Commenti
Jason Ross
il 29 Nov 2012
Yes, the "local" cluster is what I was referring to.
I looked further into it. I found the following:
- In 2011b, I can reproduce the issue using your example. Filename does not matter at all.
- In 2012a and 2012b, it works as you would expect. The job runs and you can load the results without incident.
I was able to get a bit more information by using getAllOutputArguments(job), but it didn't provide help in solving the issue. It looks like the problem you were seeing was something that was fixed in 2012a, though.
I don't know about your newline problem. Perhaps clearing a browser cache would help?
Jan
il 29 Nov 2012
In fact, you need two newlines to start a new paragraph. A single newline does not wrap the text.
Vedere anche
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!