Azzera filtri
Azzera filtri

GPU utilization and parallel computation With Matlab for heavy computation

26 visualizzazioni (ultimi 30 giorni)
I have decent/ok machine with core i7 (8 cores), 32G of RAM and Nvidia geForce GTX 1080i and running Matlab 2018b. At the moment I am a bit confuse about how to use these resources in best way to run my Monte-Carlo simulation code. The two questions I have now:
1- How can I make all the heavy computaion to be run on the GPU alonside parallel compution capability of Matlab rather than the CPU and hence I can decide what is best to use? I have read different help topics and the conclusion I think I have got is, the data I have to work with should be in the form of gpuArray Am I right? or do I miss something here?. let us assume that I have the foollowing simple code to be run on GPU :
First_Vector=zeros(2,3);
% First_Vector=zeros(2,3,'gpuArray'); 1
[N,M]=size(First_Vector);
%[N,M]=size(First_Vector,'gpuArray'); 2
Second_Matrix=ones(N,M,2);
%Second_Matrix=ones(N,M,2,'gpuArray'); 3
Tset1= [20 20 20:30 30 30];
%Tset1gpuArray=gpuArray(Tset1); 4
Test2= [50 50 50;60 60 60];
%Tset2=gpuArray(Tset2); 5
K=100;
% the main code
for i=1:3
for j=1:3
[element]=Function1(test(i,j),K)
Test1(i,j)=element;
end
end
Second_Matrix(:,:,1)=Test1;
[Test1]=Function2(Test1,Test2);
% End of the main code
%% Function 1
function[outcome]=Function1(A,K)
outcome=A+K;
end
%%Function 2
function[T1]=Function2(T1,T2)
T1=T1+T2;
end
does the commented lines (1-5) are enough to run the 'main code' on the GPU?
2- I have tested the following simple code on GPU and CPU, CPU performance was by far better than GPU. is that supposed to be normal ?
thanks in advanced.
G = ones(10,10,'gpuArray');
tic
for k=1:100
for i=1: 1000
for j=1:10
G(j,:)=G(j,:)+2;
end
end
end
toc
G = ones(10,10);
tic
for k=1:100
for i=1: 1000
for j=1:10
G(j,:)=G(j,:)+2;
end
end
end
toc
% Elapsed time is 0.628241 seconds.

Risposta accettata

Andrea Picciau
Andrea Picciau il 3 Lug 2019
Modificato: Andrea Picciau il 3 Lug 2019
I'll try to answer your questions in order...
  1. Yes! Isn't that great?
  2. Yes, because there are two problems with your code: (a) you're using a lot of for loops instead vector operations and (b) you're measuring GPU performance incorrectly. To fix (a), you should read this doc page that explains how to vectorize your code to get the best performance. To fix (b), you should take a look at my answer to a previous question and use the functions timeit and gputimeit.
  3 Commenti
Andrea Picciau
Andrea Picciau il 9 Lug 2019
A quick comment about what I meant by vectorising your code: I was looking at this bit here
for k = 1:100
for i = 1:1000
for j = 1:10
G(j,:) = G(j,:) + 2;
end
end
end
which could really be written as
G = G + 200000;
I imagined you were just trying to benchmark the same operation executed on a for loop, so I wrote a quick script for that. I benchmarked three versions of the same algorithm:
  • the fully vectorised version,
  • a for loop with some vectorisation,
  • a for loop without vectorisation.
My GPU is a Tesla K40c and my processor is an Intel Xeon E5-1650.
Let me show you my script:
numRows = 1000;
cpuData = ones(numRows, numRows);
gpuData = gpuArray(cpuData);
timeit(@() iVectorised(cpuData), 1) % 0.0030 seconds
gputimeit(@() iVectorised(gpuData), 1) % 9.3611e-05 seconds
timeit(@() iForLoop(cpuData), 1) % 0.0145 seconds
gputimeit(@() iForLoop(gpuData), 1) % 0.0011 seconds
timeit(@() iForLoopWithIndexing(cpuData, numRows), 1) % 0.2310 seconds
gputimeit(@() iForLoopWithIndexing(gpuData, numRows), 1) % 12.6261 seconds
%% HELPER FUNCTIONS
function dataOut = iVectorised(dataIn)
% Completely vectorised
dataOut = dataIn + 200;
end
function dataOut = iForLoop(dataIn)
% Partially vectorised, external for loop remains
for k = 1:100
dataIn = dataIn + 2;
end
dataOut = dataIn;
end
function dataOut = iForLoopWithIndexing(dataIn, numRows)
% Completely non-vectorised, uses indexing
for k = 1:100
for i = 1:numRows
dataIn(i,:) = dataIn(i,:) + 2;
end
end
dataOut = dataIn;
end
What you're observing is the last case (for loop without vectorisation). The reason it takes so long on the GPU is that indexing gpuArrays is very expensive. For example, you are:
  • moving the index i to the GPU,
  • creating a temporary gpuArray,
  • writing dataIn(i,:) to this temporary GPU array. To do this, you'll have to index dataIn by row rather than by column, (which is faster, usually),
  • scheduling dataIn(i,:) + 2 on the GPU,
  • assigning the output of this operation back to the right elements of dataIn.
To do most of these things, you need to be communicating back and forward between the CPU and the GPU, which is going to affect your performance (note: this is true for any GPU code, not just if you're using MATLAB). The vectorised version is highly optimised to avoid this ping-pong between your GPU and your CPU.
You also might want to consider larger problems. For example, the data in my script is 1000x1000, which is a reasonable size to start thinking about GPU acceleration.
Putting it all together, I would apply these two golden rules to your Monte Carlo code:
  • Reason in matrix and vector operations, not for loops. Vectorise, vectorise, vectorise.
  • Think about your overheads. Is the extra communication time worth spending? Should you use a parallel pool or a GPU?
Optimising parallel applications can be a difficult problem, but the rewards can be very high!
caesar
caesar il 9 Lug 2019
thanks again Andrea Picciau. your example is so elaborative and clear for me. As I mentioned before, i think there is no way for me to verctoris my code especially when its very complicated and includes call for different functions in iterative fashion.
the annoying thing now is that I cant make exploit the resources that I have (powerfull PC and cluster ) to the max .
regards

Accedi per commentare.

Più risposte (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by