Can LU decomposition use more than 12 cores in a Desktop?

Hi,
I have to use LU decomposition for a very large matrix. LU makes use of multicores when they are present automatically. Will it use more than 12 if they are available?
Thanks

Risposte (2)

LU takes advantage of built-in Multithreading and that extends beyond 12 cores.
The 12 core limit you are probably referring to is the number of independent workers that Parallel Computing Toolbox allows you to launch on your local machine.

8 Commenti

Thanks for the fast answer. How would the speed increase whith the number of cores?
If you had to multiply two random matrices rand(10000) for example. Will the speed of the calculation increase linearly with the number of cores?
and what about solving Ax=y by LU decomposition? How would it scale with the number of cores?
You can prove trivially that it does not scale with the number of cores.
Suppose multiplying two matrices of known size took K arithmetic instructions. If speed scaled linearly with number of cores, then if you used K cores then in order for the speed to scale linearly the work would have to be distributed over all of the cores, one arithmetic instruction each. Hypothesize that it could be done that way. Now add one more core. To maintain linear scaling, K arithmetic instructions would have to be evenly distributed amongst K+1 cores. But arithmetic instructions are indivisible, so that cannot happen. Therefore by way of contradiction we see that performance cannot increase linearly in the number of cores.
There are other technical reasons why performance cannot increase linearly that would get you far sooner. Instructions and memory have to be distributed to each worker, which takes finite time each. At some number of cores, you must exceed the available memory bandwidth, not able to put more simultaneous transfers into execution. Beyond that point, performance cannot increase linearly.
You can refine this model into serialization time ("marshalling data"), together with work transfer time and data transfer time (these times are independent of the number of loops the block should do, and probably only a few of these can overlap on the memory lines); and then execution time per core (probably independent of the other cores); and then time to marshall the results and time to transfer the results (probably only a few back-transfers can happen at the same time.) If you work it through, you find that best performance comes from having each core do enough work to make the overheads worth-while; if you have too many cores then you end up waiting on infrastructure instead of waiting on cores to execute.
Would GPUs do better?
Salvador, you have to take into account that there will be data transfer overhead if you want to scale computation by parallilizing. This overhead is far more significant when it comes to GPUs.
What is the end goal? Is this for academic purposes or for a real-world application? If this is indeed for an actual application there are, as Walter mentioned already, several other factors that needs to be considered including your infrastructure. If you can elaborate on your use-case we will be able to provide you with what tools are available to best attach the problem
Because the transfer times are longer when it comes to GPUs, you have to give the GPUs more work to make it worthwhile compared to the alternatives. As you increase the number of GPUs, there would come a time when you would not be able to load all of them before twice the length of time each GPU would take to execute the task -- the point at which it would have been more efficient to instead give a task twice as long to half as many GPUs (and so avoid about half of the loading and about half of the unloading.)
Besides, you cannot attach more than one GPU per worker process (but GPU can sometimes be shared between workers.) With heat budgets and the like, you are pressing it to put more than 2 GPU in a single standard case.
Perhaps your question was about switching to a single GPU instead of parallel cores ?
Earlier I wrote about simultaneous transfers for distributing instructions and workers. What I omitted to mention was that some cluster systems designed for high performance can send the same data to multiple core-groups simultaneously, without increased time. Provided the same information is going to each. The only such systems I can name at the moment are the blade servers built by SGI; I don't think Parallel Processing Toolkit is designed to take advantage of those facilities (but you might be able to make message passing calls using the standard message passing libraries to get it to work.) There are some kinds of problems for which that kind of facility is critical for high performance, but the majority of parallel tasks can instead be implemented in terms of tasks that can be run independently.

Accedi per commentare.

Thanks for the answers. The final goal is solving an inverse problem in medical imaging. What would be the best solution for solving these two types of problems: Matrix multiplication and linear equation solving (by LU) assuming a maximum budget of 20.000$? The matrices are very large (10 Gb) (not sparse)

Categorie

Scopri di più su Parallel Computing in Centro assistenza e File Exchange

Richiesto:

il 28 Ago 2013

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by