Speedup of mldivide by parallelisation

Dear All,
in my applications we repeatedly use mldivide to solve linear systems. It would be very important to speed the solvers up. Since they run only with 2 cpus (linux: top shows less than 200% usage) I guess parallelisation would be the right technique to achieve an improvement. I found a solution online which accelerated by more than a factor of 2, which would be a tremendous improvement. Unfortunately it does not work in my case. What do I need to change? Here is the example:
%%
clear all; clc; close all;
%%
delete(gcp('nocreate'));
parpool();
A = distributed.rand(442195,442195);
b = sum(A, 2);
tic
x_distributed = A\b;
x = gather(x_distributed);
toc
The only difference to the original example was the size of the variable A:
A = distributed.rand(20000,2000);
What do I need to change to get this effective again? Another solver?
Thanks Joerg

1 Commento

A matrix size of the you've shown, it would consume 182GB. Do you actually have a system that will hold such a large matrix?

Accedi per commentare.

 Risposta accettata

Matt J
Matt J il 14 Mar 2022
Modificato: Matt J il 14 Mar 2022
mldivide is already highly well-optimized internally, without the need for an explicit parpool. It is not clear how distributing the columns of A across parpool workers would ever be superior. Perhaps if you were to distribute the columns of b (assuming it had multiple columns) instead of A, but even then, I would be surprised if this improved upon mldivide's default parallelization. My recommendation would be to look at making A and b into gpuArrays, if you want to get some speed-up, assuming of course that you have a good graphics card.

24 Commenti

Thanks. I will follow this advice. Do you have any understanding why it worked to speed it up with the smaller but not the larger matrix?
Matt J
Matt J il 15 Mar 2022
Modificato: Matt J il 15 Mar 2022
Did you actually see speed-up when you tried it yourself with the smaller matrix, or did you just see someone claim it works online?
It gave a speed up of 2 with the smaller matrix. In case of the larger matrix there was no speedup. The computation then took slighlty more time.
Matt J
Matt J il 15 Mar 2022
Modificato: Matt J il 15 Mar 2022
As I said initially, I don't see why distributing the matrix would ever have helped, even in the smaller case. However, the size of the larger matrix would probably mean that the computation would be too big to be confined to RAM, even when split across multiple processors. That probably had something to do with the difference you see. Can you not hold your matrix in sparse form?
In my computations the matrix is actually sparse and contains 5193421 non-zero elements. This was missing in the example in the beginning. I am just testing if bicgstab does in any way help to decrease the runtime. Thanks for your help.
Using bicgstab did not help. Have to organize a GPU!
The sparse algorithms are different than the regular algorithms. At the moment I do not have any information about automatic parallelization with the sparse algorithms; the dense algorithms are already automatically run in parallel.
distributed arrays are useful only when the problem doesn't fit in the memory of a single machine, and you need to use multiple machines to store your arrays. It sounds like your problem isn't so large that you need that.
gpuArray supports sparse mldivide and the sparse iterative solvers. See this page for a list of supported functions. Generally speaking, sparse algorithms on the GPU do not get the same degree of speed-up as dense algorithms because the memory access patterns aren't as favourable.
Hi,
I have a question concerning the GPU option. The GPU stopped with an error message:
Error using \
The GPU failed to allocate memory. To continue, reset the GPU by running 'gpuDevice(1)'. If this problem persists, partition your computations into smaller pieces.
How is this possible the GPU has several GB of memory? In this example I have two matrices A and B each is (442195,442195) but they are sparse and each have 5193421 non-zero elements. This makes 5193421*8/(1024^3) = 0.0387 GB per matrix which should fit into the GPUs and leave a lot of space in the memory to execute A\B = mldivide(A,B).
Do you understand this? I compiled the code to get it on a cluster. Could this be the reason?
Best Joerg
There are some matrices (such as banded matrices) for which efficient algorithms are known... but in the general case, if the matrix is not recognized as one of the special cases, then the calculations might end up creating a dense matrix the same size as A. In your case that could potentially be over 1 terabyte of memory.
Thanks for the update but why does it work without such a large memory usage if I do not use gpuArray?
The matrix A is also recognized as sparse by issparse. Could it be the way that I execute all of this? Here is the LoC:
C = gather(gpuArray(A)\gpuArray(B));
Where size(A) = (442195,442195) but is a sparse matrix with 5193421 non-zero elements and size(B) = (442195,1).
It works witht a rather small working memory load without GPU.
Please let me add:
The application is compiled and runs on a cluster on a GPU with 80 GB memory.
Edric mentioned the sparse iterative solvers. Those are not necessarily the same algorithms as are available for cpu for all of the special cases.
That is unfortunate.
Please let me summarize: For a not completely understood reason mldivide uses much more working memory on the gpu than on the cpu for the matrix that I use. Therefore, I can not use the gpu to accelerate mldivide. At least for now this is therefore no option for me. We assume that the mldivide algorithm works differently on the gpu than on the cpu and possibly does not recognize my input matrix as sparse. I will close this thread with these comments.
It is not a matter of not recognizing the input matrix as sparse: it would be a matter of whether it is able to recognize it as belonging to one of the special cases for which it has implemented specific algorithms that can proceed with less memory.
For me it is not understandable why this works on the cpu but not on the gpu. Are those algorithms no available on the GPU?
How can I determine which algorithm is to be used? I guess this will change during the exceution of my code since the elemnts in C change.
Help is highly appreciated.
Matt J
Matt J il 17 Mar 2022
Modificato: Matt J il 17 Mar 2022
Even in cases where it works, it is usually slower for sparse matrices on the GPU, as @Edric Ellis pointed out earlier. I did not know your matrix was sparse when I recommended a GPU-based solution (though perhaps I should have guessed, based on its gigantic size).
Sparse direct solve is done by sparse LU or sparse QR, both of which generate dense factors. On the CPU the factorization and the solve can be done sequentially without creating huge intermediate matrices, but on the GPU this would completely negate any benefit of vectorized mathematics.
So unless your matrix is some sort of special case which doesn't generate dense factors, on the GPU you should use an iterative solver. In general, these are faster than the CPU solvers, both direct and iterative.
The general advice is always to use iterative solvers for large sparse matrices. They are always faster as long as they are well enough conditioned or have good preconditioners.
As to which solver to choose...well I just try them all and see which ones work, but I'm sure there's a more scientific method...
"Sparse direct solve is done by sparse LU or sparse QR, both of which generate dense factors. On the CPU the factorization and the solve can be done sequentially without creating huge intermediate matrices, but on the GPU this would completely negate any benefit of vectorized mathematics."
Interesting, is that means sparse solver, direct or iterative, is essentially mono thread ?
No, it doesn't mean that. It's just about the balance of data vs task parallelism. The GPU needs huge data parallelism whereas the CPU can take advantage of task parallelism. A highly data parallel sparse direct solver needs memory.
The iterative solvers are fine, they parallelise well on GPU and CPU.
I"m confuse about the way sparse direct solver works on CPU vs GPU.
You seems to say on GPU the number of parallel tasks are large so it implies memory saturation (in case of OP), and on CPU the "On the CPU the factorization and the solve can be done sequentially without creating huge intermediate matrices".""
"Sequentiallly" means no parallélism. Or do you means something else by "Sequeltially"?
I know that matrix factorization (QR or LU) the current step depends the previous step, and essentially i can not be transposed to parallelism, unles done on the inner iteration. Thus I interpret your claim like this.
But never mind if it's too long to develop here how sirect sover on sparse internally works.
The answer depends on how interested you are in Numerical Methods and Analysis.
GPU parallelism is data parallelism. Simplistically, I want to put all cores to work factorizing the matrix to get the best speedup. CPU parallelism is more flexible. Most CPUs have a vector processing unit on which some vector operations (e.g. 128-bit or 256-bit) can be performed, but the physical cores are mostly independent and can happily be doing different things on small amounts of data.
I've never written this algorithm myself, and maybe an academic can step in and expand for me. But I know on paper how the algorithm works, and it seems likely that as I'm computing the factors I can be immediately using the results to produce partial solutions to the equation, which I can update continuously as I compute the factors. On the CPU this could be done efficiently by giving different cores different chunks of the problem, whereas on a GPU this would no doubt be inefficient with memory and resources. The GPU is much better utilized when the fewest kernels are running on the most data, and very badly utilized if you need those kernels to communicate as we would in this case.
By "sequentially" I am, as you surmise, referring to the dependency between the different steps of the algorithm. This is kind of orthogonal to the above discussion. I'm just trying to say, in simple terms, that where results have to be computed in sequence, a CPU can still be well utilized while a GPU will generally struggle because the raw FLOPs of a single core are not good.
I hope that helps...but perhaps it doesn't really matter. If you really want to know how these algorithms are computed I could recommend some books if you like.
The number of simultaneous kernels you can run has varied a fair bit over time, and can depend on the exact model. Some models offer more control modules that control fewer cores, with each control module potentially being independent. And whether you are using single precision or double precision can make important differences in conjunction with the exact model.
So, potentially you you picked out just the right NVIDIA model, you might be able to do well on sparse computations, but on many (most) NVIDIA models, it could be pretty inefficient.

Accedi per commentare.

Più risposte (0)

Prodotti

Release

R2022a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by