Can LU decomposition use more than 12 cores in a Desktop?
Mostra commenti meno recenti
Hi,
I have to use LU decomposition for a very large matrix. LU makes use of multicores when they are present automatically. Will it use more than 12 if they are available?
Thanks
Risposte (2)
Shashank Prasanna
il 28 Ago 2013
1 voto
LU takes advantage of built-in Multithreading and that extends beyond 12 cores.
The 12 core limit you are probably referring to is the number of independent workers that Parallel Computing Toolbox allows you to launch on your local machine.
8 Commenti
Salvador Sanchez
il 28 Ago 2013
Shashank Prasanna
il 28 Ago 2013
Could you elaborate?
Salvador Sanchez
il 28 Ago 2013
Modificato: Salvador Sanchez
il 28 Ago 2013
Salvador Sanchez
il 28 Ago 2013
Walter Roberson
il 28 Ago 2013
You can prove trivially that it does not scale with the number of cores.
Suppose multiplying two matrices of known size took K arithmetic instructions. If speed scaled linearly with number of cores, then if you used K cores then in order for the speed to scale linearly the work would have to be distributed over all of the cores, one arithmetic instruction each. Hypothesize that it could be done that way. Now add one more core. To maintain linear scaling, K arithmetic instructions would have to be evenly distributed amongst K+1 cores. But arithmetic instructions are indivisible, so that cannot happen. Therefore by way of contradiction we see that performance cannot increase linearly in the number of cores.
There are other technical reasons why performance cannot increase linearly that would get you far sooner. Instructions and memory have to be distributed to each worker, which takes finite time each. At some number of cores, you must exceed the available memory bandwidth, not able to put more simultaneous transfers into execution. Beyond that point, performance cannot increase linearly.
You can refine this model into serialization time ("marshalling data"), together with work transfer time and data transfer time (these times are independent of the number of loops the block should do, and probably only a few of these can overlap on the memory lines); and then execution time per core (probably independent of the other cores); and then time to marshall the results and time to transfer the results (probably only a few back-transfers can happen at the same time.) If you work it through, you find that best performance comes from having each core do enough work to make the overheads worth-while; if you have too many cores then you end up waiting on infrastructure instead of waiting on cores to execute.
Salvador Sanchez
il 28 Ago 2013
Shashank Prasanna
il 28 Ago 2013
Modificato: Shashank Prasanna
il 28 Ago 2013
Salvador, you have to take into account that there will be data transfer overhead if you want to scale computation by parallilizing. This overhead is far more significant when it comes to GPUs.
What is the end goal? Is this for academic purposes or for a real-world application? If this is indeed for an actual application there are, as Walter mentioned already, several other factors that needs to be considered including your infrastructure. If you can elaborate on your use-case we will be able to provide you with what tools are available to best attach the problem
Walter Roberson
il 28 Ago 2013
Because the transfer times are longer when it comes to GPUs, you have to give the GPUs more work to make it worthwhile compared to the alternatives. As you increase the number of GPUs, there would come a time when you would not be able to load all of them before twice the length of time each GPU would take to execute the task -- the point at which it would have been more efficient to instead give a task twice as long to half as many GPUs (and so avoid about half of the loading and about half of the unloading.)
Besides, you cannot attach more than one GPU per worker process (but GPU can sometimes be shared between workers.) With heat budgets and the like, you are pressing it to put more than 2 GPU in a single standard case.
Perhaps your question was about switching to a single GPU instead of parallel cores ?
Earlier I wrote about simultaneous transfers for distributing instructions and workers. What I omitted to mention was that some cluster systems designed for high performance can send the same data to multiple core-groups simultaneously, without increased time. Provided the same information is going to each. The only such systems I can name at the moment are the blade servers built by SGI; I don't think Parallel Processing Toolkit is designed to take advantage of those facilities (but you might be able to make message passing calls using the standard message passing libraries to get it to work.) There are some kinds of problems for which that kind of facility is critical for high performance, but the majority of parallel tasks can instead be implemented in terms of tasks that can be run independently.
Salvador Sanchez
il 29 Ago 2013
0 voti
Categorie
Scopri di più su Parallel Computing in Centro assistenza e File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!