GPU training of neural network with parallel computing toolbox unreasonably slow, what am I missing?

Question

2 voti

I’m trying to speed up the training of some NARNET neural networks by using the GPU support that you get from the parallel computing toolbox but so far I haven’t been getting it to work. Or rather, it is working but it’s unreasonably slow. According to the documentation training on a GPU instead of the CPU shouldn’t be any harder than adding the statement 'useGPU','yes” to the training command. However, if I simply create some dummy data, for example a sine wave with 900 values, and train a NARNET on it using the CPU like so:

%CPU training
T = num2cell(sin(1:0.01:10));
net = narnet( 1:2, 10 ); 
[ Xs, Xsi, Asi, Ts] = preparets( net, {}, {}, T );
rng(0)
net.trainFcn = 'trainscg';
tic
net = train(net,Xs,Ts,'showResources','yes' );
toc %2.77

The training takes less than 3 seconds. But when doing the exact same thing on a CUDA supported GTX 760 GPU:

%GPU training
T = num2cell(sin(1:0.01:10));
net = narnet( 1:2, 10 ); 
[ Xs, Xsi, Asi, Ts] = preparets( net, {}, {}, T );
rng(0)
net.trainFcn = 'trainscg';
tic
net = train(net,Xs,Ts,'useGPU','yes','showResources','yes' );
toc % 1247.6

Incredibly the training takes over 20 minutes!

I’ve read through Mathworks fairly extensive documentation on parallel and GPU computing with the neural network toolbox ( link here ) and seen that there are a few things that can/should be done when calculating with a GPU for example converting the input and target data to GPU arrays before training with the nndata2gpu command and replacing any tansig activation functions with elliotsig which does speed up the training a bit:

%Improved GPU training
T = num2cell(sin(1:0.01:10));
net = narnet( 1:2, 10 ); 
[ Xs, Xsi, Asi, Ts ] = preparets( net, {}, {}, T );
rng(0)
net = configure(net,Xs,Ts); 
Xs = nndata2gpu(Xs);
Ts = nndata2gpu(Ts);
Xsi = nndata2gpu(Xsi);
for i=1:net.numLayers
  if strcmp(net.layers{i}.transferFcn,'tansig')
    net.layers{i}.transferFcn = 'elliotsig';
  end
end
net.trainFcn = 'trainscg';
tic
net = train(net,Xs,Ts,'showResources','yes' );
toc  %70.79

The training here only takes about 70 seconds, but still it’s many times slower compared to just doing it on my CPU. I’ve tried several different sized data series and network architectures but I’ve never seen the GPU training being able to compete with the CPU which is strange since as I understand it most professional ANN research is done using GPU’s?

What am I doing wrong here? Clearly I must be missing something fundamental.

Thanks

1 Commento
Mostra -1 commenti meno recenti Nascondi -1 commenti meno recenti

Greg Heath il 10 Lug 2015

Modificato: Greg Heath il 10 Lug 2015

You don't need an if statement to replace tansig by elliotsig. Just replace it right after you define the net.

My elliot4sig is a little faster

http://www.mathworks.com/matlabcentral/answers/56137-how-to-use-a-custom-transfer-function-in-neural-net-training

elliot4sig(x) = x/(0.25 + abs(x))

Greg

Accedi per commentare.

Accedi per rispondere a questa domanda.

Follow Question

Answer 1

Mark Hudson Beale il 15 Lug 2015

0 voti

Getting a speed up with a GPU requires a couple things:

1) The amount of time spent in gradient calculations (which happen on CPU or GPU as you request) is significant compared to the training step update (which still happens on the CPU).

2) The problem allows enough parallelism to run efficiently on the much slower but much greater number of GPU cores relative to the CPU.

For both requirements, the larger the dataset and the larger the neural network, the more parallelism that can be taken advantage of and the greater percentage of calculations are in the gradient so the training steps are not a speed bottleneck.

The NAR problem you defined only has 899 steps with a 10 neuron network. The fact that both dataset and network are very small is why you are not seeing a speedup. Problems that take only a few seconds on CPU probably are not going to see much of a speedup with GPU.

You are correct, that at this time NAR networks using NNDATA2GPU formatted data result in faster training than gpuArray.

1 Commento
Mostra -1 commenti meno recenti Nascondi -1 commenti meno recenti

PetterS il 15 Lug 2015

I just tried with 50 000 values and 25 hidden nodes, if I go much higher than that I get an error saying my GPU is out of memory (and for practical purposes I don’t think I will ever need more than 25 nodes), and I used nndata2gpu(Ts,’single’) to convert the data to singles. STILL – the GPU training is 10% slower than CPU.

So if there is nothing else I can do to speed it up I guess GPU training my NARNETS is a lost cause…

Thanks anyway

Accedi per commentare.

Answer 2

Adam Hug il 2 Lug 2015

0 voti

I suspect the problem size of 900 values may be too small for you to benefit from GPU architecture. Especially since you can easily fit 900 values into a CPU cache. The problem sizes need to be much larger for the communication between the CPU and GPU to be small in comparison to the computation. Try a sine wave with one million values and see if the GPU outperforms the CPU.

4 Commenti
Mostra 2 commenti meno recenti Nascondi 2 commenti meno recenti

PetterS il 2 Lug 2015

Modificato: PetterS il 3 Lug 2015

Apri in MATLAB Online

If I try doing a sine wave with one million values I get an error saying my GPU device is out of memory so I can’t do that. A hundred thousand values will work however and then GPU training takes 7076 seconds. Training a hundred thousand values on the CPU takes 6305 seconds.

So it’s possible that the performance gap has decreased somewhat when increasing the size of the problem, but 100 000 values should still be more than what I can fit in my 8Mb CPU cache and still the GPU didn’t offer any performance increase at all and I can’t even increase the problem size much more before running out of GPU memory.

And another suspicious thing that I noticed, even though it’s not completely related to the original GPU question, is that if I use the parallel CPU training command when training the net by typing: 'useParallel','yes' (instead of ‘useGPU’ ) Matlab reports the following resources:

Computing Resources:
Parallel Workers:
  Worker 1 on Petter- Computer, MEX on PCWIN64
  Worker 2 on Petter- Computer, Unused
  Worker 3 on Petter- Computer, Unused
  Worker 4 on Petter- Computer, Unused

Shouldn’t “ MEX on PCWIN64 ” be stated on all four of my workers? Training like this takes 7740 seconds which is even longer than without the parallel command… Maybe that is an indication that Matlab is having trouble running the training algorithm in parallel and that’s also why the GPU performance isn’t good?

Update: when I’m using the default Levenberg-Marquardt training algorithm all workers are being utilized when calling 'useParallel','yes’ , so maybe the problem is in the Scaled Conjugate Gradient training.

PetterS il 13 Lug 2015

Modificato: PetterS il 13 Lug 2015

Apri in MATLAB Online

As far as I understand it’s not possible to feed single precision numbers directly to a neural network in matlab, the inputs have to be presented as a cell array. I don’t know enough about cell arrays to be certain how they work, is the cell just a container that can hold singles or does the cell count as its own data type?

I’ve tried to convert Xs & Ts from cells to double then convert to singles and then back to cell array again like this:

Xs = num2cell(single(cell2mat(Xs)));
Ts = num2cell(single(cell2mat(Ts)));

Training with the input data like this improved the GPU performance with 10-15% so it got a bit better but still nothing too impressive. Am I doing the single conversion correctly? According to wikipedias singles/doubles performance rating my GPU should be much more than 10-15% faster at working with singles.

Amanjit Dulai il 14 Lug 2015

Apri in MATLAB Online

You should be able to convert the data to single precision with nndata2gpu as follows:

Xs = nndata2gpu(Xs,'single');

Accedi per commentare.

GPU training of neural network with parallel computing toolbox unreasonably slow, what am I missing?

1 Commento
Mostra -1 commenti meno recenti Nascondi -1 commenti meno recenti

Risposta accettata

1 Commento
Mostra -1 commenti meno recenti Nascondi -1 commenti meno recenti

Più risposte (1)

4 Commenti
Mostra 2 commenti meno recenti Nascondi 2 commenti meno recenti

Categorie

Prodotti

Tag

Community Treasure Hunt

GPU training of neural network with parallel computing toolbox unreasonably slow, what am I missing?

1 Commento Mostra -1 commenti meno recenti Nascondi -1 commenti meno recenti

Risposta accettata

1 Commento Mostra -1 commenti meno recenti Nascondi -1 commenti meno recenti

Più risposte (1)

4 Commenti Mostra 2 commenti meno recenti Nascondi 2 commenti meno recenti

Categorie

Prodotti

Tag

Vedere anche

Community Treasure Hunt

1 Commento
Mostra -1 commenti meno recenti Nascondi -1 commenti meno recenti

1 Commento
Mostra -1 commenti meno recenti Nascondi -1 commenti meno recenti

4 Commenti
Mostra 2 commenti meno recenti Nascondi 2 commenti meno recenti