Azzera filtri
Azzera filtri

How can I fix CUDNN_STAT​US_EXECUTI​ON_FAILED error while training a Faster RCNN on a laptop RTX 3070?

3 visualizzazioni (ultimi 30 giorni)
Hey everyone,
I am relatively new to deep learning and I've been trying to train a Faster RCNN for multi-class object detection on a custom dataset (3086 training images and 386 validation images of size [224 396 3]). The number of classes is 11.
Recently I've started having a warning of "CUDA_ERROR_ILLEGAL_ADDRESS" which leads to an "Error using nnet.internal.cnngpu.reluForward
Unexpected error calling cuDNN: CUDNN_STATUS_EXECUTION_FAILED".
My gpu devide has the following properties (the available memory line dissapeared after this error popped up, but usually is 7.33GB of available memory):
Name: 'NVIDIA GeForce RTX 3070 Laptop GPU'
Index: 1
ComputeCapability: '8.6'
SupportsDouble: 1
DriverVersion: 11.3000
ToolkitVersion: 11
MaxThreadsPerBlock: 1024
MaxShmemPerBlock: 49152
MaxThreadBlockSize: [1024 1024 64]
MaxGridSize: [2.1475e+09 65535 65535]
SIMDWidth: 32
TotalMemory: 8.5899e+09
MultiprocessorCount: 40
ClockRateKHz: 1620000
ComputeMode: 'Default'
GPUOverlapsTransfers: 1
KernelExecutionTimeout: 1
CanMapHostMemory: 1
DeviceSupported: 1
DeviceAvailable: 1
DeviceSelected: 1
And these are my training options (backbone network is a ResNet-50):
options = trainingOptions('sgdm', ...
'MiniBatchSize', 1, ...
'InitialLearnRate', 1e-3, ...
'LearnRateSchedule', 'piecewise', ...
'LearnRateDropFactor', 0.2, ...
'LearnRateDropPeriod', 2, ...
'MaxEpochs', 3, ...
'ExecutionEnvironment','gpu',...
'ValidationData', resizedDsVal, ...
'Verbose',true);
try
net.internal.cnngpu.reluForward(1);
catch ME
end
% Train the Faster R-CNN detector
fasterRCNN = trainFasterRCNNObjectDetector(augmentedresizedDsTrain, lgraph, options, ...
'NegativeOverlapRange',[0 0.3], 'PositiveOverlapRange',[0.6 1]);
I also should add that this error is appearing in the middle of the training process, so in the beginning of it everything looks fine but then this appears:
Initializing input data normalization.
|==========================================================================================================================================================================================|
| Epoch | Iteration | Time Elapsed | Mini-batch | Mini-batch | Mini-batch | RPN Mini-batch | RPN Mini-batch | Validation | Validation | Validation | Base Learning |
| | | (hh:mm:ss) | Loss | Accuracy | RMSE | Accuracy | RMSE | Loss | Accuracy | RMSE | Rate |
|==========================================================================================================================================================================================|
| 1 | 1 | 00:03:26 | 3.1444 | 30.39% | 0.18 | 57.48% | 0.90 | 3.2009 | 34.06% | 0.16 | 0.0010 |
(...)
| 1 | 1150 | 01:32:11 | 0.0892 | 99.42% | 0.16 | 100.00% | 0.21 | 0.2707 | 99.32% | 0.17 | 0.0010 |
| 1 | 1200 | 01:36:02 | 0.9868 | 98.56% | 0.14 | 94.53% | 1.82 | 0.2551 | 99.44% | 0.17 | 0.0010 |
| 1 | 1250 | 01:39:53 | 0.1590 | 99.80% | 0.13 | 98.44% | 0.37 | 0.2649 | 99.49% | 0.16 | 0.0010 |
| 1 | 1300 | 01:43:18 | 2.0556 | 100.00% | | 98.44% | 2.77 | 0.2512 | 99.19% | 0.15 | 0.0010 |
| 1 | 1350 | 01:46:36 | 0.0542 | 100.00% | 0.08 | 100.00% | 0.23 | 0.2476 | 99.45% | 0.14 | 0.0010 |
Warning: Encountered unexpected error during CUDA execution. The CUDA error was:
CUDA_ERROR_ILLEGAL_ADDRESS
Warning: Encountered unexpected error during CUDA execution. The CUDA error was:
CUDA_ERROR_ILLEGAL_ADDRESS
Warning: Encountered unexpected error during CUDA execution. The CUDA error was:
CUDA_ERROR_ILLEGAL_ADDRESS
Error using nnet.internal.cnngpu.reluForward
Unexpected error calling cuDNN: CUDNN_STATUS_EXECUTION_FAILED.
Error in nnet.internal.cnn.util.VectorReporter/computeAndReport (line 68)
feval( method, this.Reporters{i}, varargin{:} );
Error in nnet.internal.cnn.util.VectorReporter/computeIteration (line 24)
computeAndReport( this, 'computeIteration', summary, network );
Error in nnet.internal.cnn.Trainer/train (line 144)
reporter.computeIteration( this.Summary, net );
Error in vision.internal.cnn.trainNetwork (line 110)
trainedNet = trainer.train(trainedNet, trainingDispatcher);
Error in trainFasterRCNNObjectDetector>iTrainEndToEnd (line 901)
[net, info] = vision.internal.cnn.trainNetwork(...
Error in trainFasterRCNNObjectDetector (line 428)
[detector, info] = iTrainEndToEnd(trainingData, fastRCNN, options, params, executionSettings, imageInfo);
Error in FasterRCNN_ResNet50 (line 106)
fasterRCNN = trainFasterRCNNObjectDetector(augmentedresizedDsTrain, lgraph, options, ...
My MATLAB version is 2021a and when I reboot my PC this issue seems to temporarly disapear as I am able to restart my training process. However, after 1-2hours this error pops up again and stops my training.
Any information regarding on how to fix this issue would be valuable, thank you!

Risposte (1)

CANBERK TATLI
CANBERK TATLI il 24 Lug 2022
Did you solve the problem? I'm also encountering the same problem. I'm also using a 3070 and Matlab R2019a version.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by