Running Code on GPU Seems much Slower than Doing so on CPU

Question

Theron FARRELL il 27 Mag 2019

0
Link

Link diretto a questa domanda

https://it.mathworks.com/matlabcentral/answers/464145-running-code-on-gpu-seems-much-slower-than-doing-so-on-cpu

Spostato: Walter Roberson il 27 Ott 2024 alle 18:53

Hi there,

I am using a Thinkpad W550, and my GPU is Quadro K620M. As I simply ran the following code, the profile showed that running on the GPU was much slower.

function Test_GPU()
a = [10^8, 18^8];
h = a;
c = conv2(h, a, 'full');
% Running in doube precision got a similar result
aa = single(gpuArray([10^8, 18^8]));
hh = aa;
cc = conv2(hh, aa, 'full');
end

So I ran the official gpuBench()

http://www.mathworks.com/matlabcentral/fileexchange/34080-gpubench

The result is astonishing! Running on the GPU IS slower, much much more slower.

The first picture shows the result from GPU, and the second, CPU.

I will be very grateful if anyone could tell me why. Many thanks

2 Commenti
Mostra NessunoNascondi Nessuno

Theron FARRELL il 27 Mag 2019

A follow-up question. After gpuBench finished running, no HTML report was given.

Anything with browser setting, etc.?

Jan il 27 Mag 2019

a = [10^8, 18^8] is a [1x2] vector. For a speed comparison, this job is too tiny.

Accedi per commentare.

Accedi per rispondere a questa domanda.

Answer 1

Andrea Picciau il 29 Mag 2019

1
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/464145-running-code-on-gpu-seems-much-slower-than-doing-so-on-cpu#answer_377034

Modificato: Andrea Picciau il 29 Mag 2019

Apri in MATLAB Online

You don't need to disable JIT acceleration. Rather, you need to measure using timeit and gputimeit like so:

% CPU data
a = ones([100, 100], 'single');
h = a;
% GPU data
aa = gpuArray(a);
hh = gpuArray(h);
% Measuring CONV2 with one output
cpuTime = timeit(@() conv2(h, a, 'full'), 1);
gpuTime = gputimeit(@() conv2(hh, aa, 'full'), 1);

Why you might want to do this:

MATLAB uses lazy evaluation to schedule the operations on your GPU, which introduces some asynchronicity in the GPU's behaviour. The same mechanism is not used on the CPU.
gputimeit takes lazy evaluation into consideration and also repeats the measure several times, weighing caching effects, overheads, and first-time costs.
timeit also repeats the measure several times, but it doesn't take lazy evaluation into consideration.
tic/toc neither repeat the measure nor takes lazy evaluation into consideration.
the profiler is somewhat similar to tic/toc but it also introduces some overhead in the measurement because it has to trace the whole call stack (which is why is useful for investigating rather than extracting rigorous measurements).

What results do you get? Let us know.

Given your setup, it woudln't be strange if gpuTime>cpuTime. Laptop GPUs are usually not optimized for computing, and it might be the case that yours is driving the graphics too.

5 Commenti
Mostra 3 commenti meno recentiNascondi 3 commenti meno recenti

Walter Roberson il 30 Mag 2019

Apri in MATLAB Online

Lazy Evaluation means that instead of computing something immediately, the system keeps track of how you would compute it if you ever turn out to need it. Then, it only computes the item when you ask for the result.

For example, suppose I have a computation such as

result = sum(diag(A*B))

for matrices A and B. The normal non-lazy way of computing this would be to calculate all of the A*B entries and then to take the result. However, you do not need to compute all of the entries to get the final result: you can do

for i = 1 : size(result,2)
    result(1,i) = dot(A(i,:), B(:,i).');
end

The lazy way of computing this would not bother to compute C(i,j) = dot(A(i,:), B(:,j).') until C(i,j) is needed for something else, and since only C(i,i) end up being needed for the diag(), you get away with notably less computation.

Lazy computation for GPU takes a record of the instructions you give, but does not do the computation until the point where the result is needed, such as when you gather() the result back to the CPU.

Walter Roberson il 30 Mag 2019

Apri in MATLAB Online

profile is probably best suited for determining how many times a line was invoked, and for determining which general sections of code are the most time consuming.

However, the actual amount of computation required might be substantially less if only the Just In Time compiler had not been turned off by profiling -- and that in turn means that in some cases the most expensive section of code is not what it appears at first. Once you have identified a general section of code as being expensive then unless it makes obvious sense that one particular line will certainly be the most expensive part, it might be better to rewrite the section into smaller functions to narrow down costs and so restrict your optimization efforts to functions that are disproportinately expensive.

To narrow down which of several different implementations is most efficient for a purpose, create test functions for each variety and time them with timeit(). Repeat the timeit() a number of times as the timings can vary a fair bit. Be careful on interpreting the results: if you have two different routines you are calling timeit() on and the first one seems to be more expensive, then sometimes the real issue is something to do with JIT. This leads me to doing tests such as

N = 50;
timesA1 = zeros(1,N); timesA2 = zeros(1,N); timesB1 = zeros(1,N);
FA = @FirstVariation; FB = @SecondVariation
for K = 1 : N; timesA1(K) = timeit(FA, 0); end
for K = 1 : N; timesB1(K) = timeit(FB, 0); end
for K = 1 : N; timesA2(K) = timeit(FA, 0); end

It is common for timesA1 to trend much higher than timesB1 and then for timesA2 to show times much closer to timesB1 -- that is, re-testing exactly the same thing can sometimes end up much faster for reasons that are not at all clear.

For timing tests on variations involving GPU use, always use gputimeit()

Theron FARRELL il 30 Mag 2019

Greatly explicated. Cheers a lot!

Accedi per commentare.

Answer 2

Walter Roberson il 27 Mag 2019

0
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/464145-running-code-on-gpu-seems-much-slower-than-doing-so-on-cpu#answer_376738

The Quadro 620M was a Maxwell architecture, GM108 chip. That architecture does double precision at 1/32 of single precision.

MTimes operations are delegated to LAPACK by MATLAB for sufficiently large arrays. LAPACK automatically uses all available CPU cores.

My CPU shows up as faster for double precision MTIMES and backslach than my GTX 780M does, but the GPU was much faster for single precision, and is faster for double precision FFT than my CPU measures as.

8 Commenti
Mostra 6 commenti meno recentiNascondi 6 commenti meno recenti

Theron FARRELL il 27 Mag 2019

Modificato: Theron FARRELL il 27 Mag 2019

Apri in MATLAB Online

Yes, the performance of K620M is 863.2 GFLOPS for single, and 26.98 GFLOPS for double.

Would it be because I am using the laptop version of GPU? As suggested in an answer to another post, GPU in laptops are severely limited.

https://www.mathworks.com/matlabcentral/answers/383903-how-to-run-train-a-convolutional-neural-network-for-regression-example-in-single-precision-on-lapt?s_tid=answers_rc1-2_p2_MLT

I re-ran gpuBench(), the web report was generated this time

I also re-ran the simple code with meaningful sizes of matrices as follows

function Test_GPU()
m = 10^2;
n = 10^2;
% CPU
a = ones(m, n);
h = a;
c = conv2(h, a, 'full');
% GPU single
aa = gpuArray(ones(m, n, 'single'));
hh = aa;
cc = conv2(hh, aa, 'full');
% GPU double
aaa = gpuArray(ones(m, n));
hhh = aaa;
ccc = conv2(hhh, aaa, 'full');
end

And now, it seems that 'single' is the fastest. So strange....

In fact, as I wrote my actual code for work, I did deliberately declare all gpuArray as single, and I even used something like

a = gpuArray(zeros(100, 100, 'single'))

instead of

a = single(gpuArray(zeros(100, 100)))

yet the code performance still lagged behind by its CPU counterpart.

By comparing GPU and CPU versions of my actual code,I found running time spent on exp() in the GPU version exceeded that in the CPU version, hence I ran another piece of code about exp() operations.

function Test_GPU()
z = ones(1000, 1);
a = exp(-z);
zz = gpuArray(ones(1000, 1, 'single'));
aa = arrayfun(@exp, zz);
zzz = gpuArray(ones(1000, 1));
aaa= exp(-zzz);
end

The fact that CPU ran the fastest corroborated the aforementioned observation, even if I used arrayfun(). But why? How should I modify the code?

What would be your ideas and suggestions, then? Cheers

BTW, I have dedicated GPU to working for MATLAB in NVIDIA's control panel, and removed all other applications.

Andrea Picciau il 29 Mag 2019

Modificato: Walter Roberson il 29 Mag 2019

@Jan: Sorry, I meant to say "Theron". I changed my previous comment to fix that.

Jan il 29 Mag 2019

@Theron: I do not undestand, why you expect arrayfun to have a positive effect on the processing speed. The opposite is expected.

Starting the profiler disables the JIT accleration automatically, because the JIT can re-oreder the commands if it improves the speed, but then there is no relation between the timings and te code lines anymore. This means, that running the profiler can affect the run time massively, especially for loops. Of course this sounds to be counter-productive for the job of a profiler - and it is so, in fact. Therefore the profiler and tic/toc should be used both, because they have different advantages and disadvantages. For measuring the speed of single commands or elementary loops, the profiler is not a good choice.

Accedi per commentare.

Answer 3

Miguel il 27 Ott 2024 alle 15:43

0
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/464145-running-code-on-gpu-seems-much-slower-than-doing-so-on-cpu#answer_1537700

I am running a vehicle simulation on GPU vs CPU, and takes hughe ammount of time, and I have a gaming PC, why?

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Accedi per commentare.

Running Code on GPU Seems much Slower than Doing so on CPU

2 Commenti
Mostra NessunoNascondi Nessuno

Risposta accettata

5 Commenti
Mostra 3 commenti meno recentiNascondi 3 commenti meno recenti

Più risposte (2)

8 Commenti
Mostra 6 commenti meno recentiNascondi 6 commenti meno recenti

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Vedere anche

Categorie

Tag

Prodotti

Release

Community Treasure Hunt

Running Code on GPU Seems much Slower than Doing so on CPU

2 Commenti Mostra NessunoNascondi Nessuno

Risposta accettata

5 Commenti Mostra 3 commenti meno recentiNascondi 3 commenti meno recenti

Più risposte (2)

8 Commenti Mostra 6 commenti meno recentiNascondi 6 commenti meno recenti

0 Commenti Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Vedere anche

Categorie

Tag

Prodotti

Release

Community Treasure Hunt

2 Commenti
Mostra NessunoNascondi Nessuno

5 Commenti
Mostra 3 commenti meno recentiNascondi 3 commenti meno recenti

8 Commenti
Mostra 6 commenti meno recentiNascondi 6 commenti meno recenti

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti