trainYOLOv2ObjectDetector

Train YOLO v2 object detector

Description

example

detector = trainYOLOv2ObjectDetector(trainingData,lgraph,options) returns an object detector trained using you only look once version 2 (YOLO v2) network architecture specified by the input lgraph. The options input specifies training parameters for the detection network.

example

detector = trainYOLOv2ObjectDetector(trainingData,checkpoint,options) resumes training from the saved detector checkpoint.

You can use this syntax to:

  • Add more training data and continue the training.

  • Improve training accuracy by increasing the maximum number of iterations.

detector = trainYOLOv2ObjectDetector(trainingData,detector,options) continues training a YOLO v2 object detector. Use this syntax for fine-tuning a detector.

detector = trainYOLOv2ObjectDetector(___,'TrainingImageSize',trainingSizes) specifies the image sizes for multiscale training by using a name-value pair in addition to the input arguments in any of the preceding syntaxes.

example

[detector,info] = trainYOLOv2ObjectDetector(___) also returns information on the training progress, such as the training accuracy and learning rate for each iteration.

Examples

collapse all

Load the training data for vehicle detection into the workspace.

data = load('vehicleTrainingData.mat');
trainingData = data.vehicleTrainingData;

Specify the directory in which training samples are stored. Add full path to the file names in training data.

dataDir = fullfile(toolboxdir('vision'),'visiondata');
trainingData.imageFilename = fullfile(dataDir,trainingData.imageFilename);

Randomly shuffle data for training.

rng(0);
shuffledIdx = randperm(height(trainingData));
trainingData = trainingData(shuffledIdx,:);

Create an imageDatastore using the files from the table.

imds = imageDatastore(trainingData.imageFilename);

Create a boxLabelDatastore using the label columns from the table.

blds = boxLabelDatastore(trainingData(:,2:end));

Combine the datastores.

ds = combine(imds, blds);

Load a preinitialized YOLO v2 object detection network.

net = load('yolov2VehicleDetector.mat');
lgraph = net.lgraph
lgraph = 
  LayerGraph with properties:

         Layers: [25×1 nnet.cnn.layer.Layer]
    Connections: [24×2 table]

Inspect the layers in the YOLO v2 network and their properties. You can also create the YOLO v2 network by following the steps given in Create YOLO v2 Object Detection Network.

lgraph.Layers
ans = 
  25x1 Layer array with layers:

     1   'input'               Image Input               128x128x3 images
     2   'conv_1'              Convolution               16 3x3 convolutions with stride [1  1] and padding [1  1  1  1]
     3   'BN1'                 Batch Normalization       Batch normalization
     4   'relu_1'              ReLU                      ReLU
     5   'maxpool1'            Max Pooling               2x2 max pooling with stride [2  2] and padding [0  0  0  0]
     6   'conv_2'              Convolution               32 3x3 convolutions with stride [1  1] and padding [1  1  1  1]
     7   'BN2'                 Batch Normalization       Batch normalization
     8   'relu_2'              ReLU                      ReLU
     9   'maxpool2'            Max Pooling               2x2 max pooling with stride [2  2] and padding [0  0  0  0]
    10   'conv_3'              Convolution               64 3x3 convolutions with stride [1  1] and padding [1  1  1  1]
    11   'BN3'                 Batch Normalization       Batch normalization
    12   'relu_3'              ReLU                      ReLU
    13   'maxpool3'            Max Pooling               2x2 max pooling with stride [2  2] and padding [0  0  0  0]
    14   'conv_4'              Convolution               128 3x3 convolutions with stride [1  1] and padding [1  1  1  1]
    15   'BN4'                 Batch Normalization       Batch normalization
    16   'relu_4'              ReLU                      ReLU
    17   'yolov2Conv1'         Convolution               128 3x3 convolutions with stride [1  1] and padding 'same'
    18   'yolov2Batch1'        Batch Normalization       Batch normalization
    19   'yolov2Relu1'         ReLU                      ReLU
    20   'yolov2Conv2'         Convolution               128 3x3 convolutions with stride [1  1] and padding 'same'
    21   'yolov2Batch2'        Batch Normalization       Batch normalization
    22   'yolov2Relu2'         ReLU                      ReLU
    23   'yolov2ClassConv'     Convolution               24 1x1 convolutions with stride [1  1] and padding [0  0  0  0]
    24   'yolov2Transform'     YOLO v2 Transform Layer   YOLO v2 Transform Layer with 4 anchors
    25   'yolov2OutputLayer'   YOLO v2 Output            YOLO v2 Output with 4 anchors

Configure the network training options.

options = trainingOptions('sgdm',...
          'InitialLearnRate',0.001,...
          'Verbose',true,...
          'MiniBatchSize',16,...
          'MaxEpochs',30,...
          'Shuffle','never',...
          'VerboseFrequency',30,...
          'CheckpointPath',tempdir);

Train the YOLO v2 network.

[detector,info] = trainYOLOv2ObjectDetector(ds,lgraph,options);
*************************************************************************
Training a YOLO v2 Object Detector for the following object classes:

* vehicle

Checking training data...done.
Initializing input data normalization.
Training on single GPU.
|========================================================================================|
|  Epoch  |  Iteration  |  Time Elapsed  |  Mini-batch  |  Mini-batch  |  Base Learning  |
|         |             |   (hh:mm:ss)   |     RMSE     |     Loss     |      Rate       |
|========================================================================================|
|       1 |           1 |       00:00:00 |         7.14 |         50.9 |          0.0010 |
|       2 |          30 |       00:00:15 |         1.41 |          2.0 |          0.0010 |
|       4 |          60 |       00:00:29 |         1.33 |          1.8 |          0.0010 |
|       5 |          90 |       00:00:43 |         0.85 |          0.7 |          0.0010 |
|       7 |         120 |       00:00:58 |         0.92 |          0.8 |          0.0010 |
|       9 |         150 |       00:01:12 |         1.05 |          1.1 |          0.0010 |
|      10 |         180 |       00:01:26 |         0.65 |          0.4 |          0.0010 |
|      12 |         210 |       00:01:40 |         0.74 |          0.5 |          0.0010 |
|      14 |         240 |       00:01:55 |         0.72 |          0.5 |          0.0010 |
|      15 |         270 |       00:02:08 |         0.60 |          0.4 |          0.0010 |
|      17 |         300 |       00:02:23 |         0.55 |          0.3 |          0.0010 |
|      19 |         330 |       00:02:37 |         0.53 |          0.3 |          0.0010 |
|      20 |         360 |       00:02:50 |         0.52 |          0.3 |          0.0010 |
|      22 |         390 |       00:03:04 |         0.54 |          0.3 |          0.0010 |
|      24 |         420 |       00:03:18 |         0.50 |          0.3 |          0.0010 |
|      25 |         450 |       00:03:32 |         0.48 |          0.2 |          0.0010 |
|      27 |         480 |       00:03:46 |         0.64 |          0.4 |          0.0010 |
|      29 |         510 |       00:04:00 |         0.44 |          0.2 |          0.0010 |
|      30 |         540 |       00:04:13 |         0.43 |          0.2 |          0.0010 |
|========================================================================================|
Detector training complete.
*************************************************************************

Inspect the properties of the detector.

detector
detector = 
  yolov2ObjectDetector with properties:

            ModelName: 'vehicle'
              Network: [1×1 DAGNetwork]
           ClassNames: {'vehicle'}
          AnchorBoxes: [4×2 double]
    TrainingImageSize: [128 128]

You can verify the training accuracy by inspecting the training loss for each iteration.

figure
plot(info.TrainingLoss)
grid on
xlabel('Number of Iterations')
ylabel('Training Loss for Each Iteration')

Read a test image into the workspace.

img = imread('detectcars.png');

Run the trained YOLO v2 object detector on the test image for vehicle detection.

[bboxes,scores] = detect(detector,img);

Display the detection results.

if(~isempty(bboxes))
    img = insertObjectAnnotation(img,'rectangle',bboxes,scores);
end
figure
imshow(img)

Input Arguments

collapse all

Labeled ground truth images, specified as a datastore or a table.

  • If you use a datastore, calling the datastore with the read and readall functions must return a cell array or table with three columns, {images,boxes,labels}.

    • images — The first column must be a cell vector of images that can be grayscale, RGB, or a M-by-N-by-P multichannel image.).

    • boxes — The second column must be a cell vector that contains M-by-4 matrices of bounding boxes in the format [x,y,width,height]. The vectors represent the location and size of bounding boxes for the objects in each image.

    • labels — The third column must be a cell vector that contains M-by-1 categorical vectors containing object class names. All categorical data returned by the datastore must contain the same categories.

    You can use the combine function to create the datastore to use for training.

    • imageDatastore — Create a datastore containing images.

    • boxLabelDatastore — Create a datastore containing bounding boxes and labels.

    • combine(imds,blds) — Combine images, bounding boxes, and labels into one datastore.

    For more information, see Datastores for Deep Learning (Deep Learning Toolbox).

  • If you use a table, the table must have two or more columns. The first column of the table must contain image file names with paths. The images must be grayscale or truecolor (RGB) and they can be in any format supported by imread. Each of the remaining columns must be a cell vector that contains M-by-4 matrices that represent a single object class, such as vehicle, flower, or stop sign. The columns contain 4-element double arrays of M bounding boxes in the format [x,y,width,height]. The format specifies the upper-left corner location and size of the bounding box in the corresponding image. To create a ground truth table, you can use the Image Labeler app or Video Labeler app. To create a table of training data from the generated ground truth, use the objectDetectorTrainingData function.

Note

When the training data is specified using a table, the trainYOLOv2ObjectDetector function checks these conditions

  • The bounding box values must be integers. Otherwise, the function automatically rounds each noninteger values to its nearest integer.

  • The bounding box must not be empty and must be within the image region. While training the network, the function ignores empty bounding boxes and bounding boxes that lie partially or fully outside the image region.

Layer graph, specified as a LayerGraph object. The layer graph contains the architecture of the YOLO v2 network. You can create this network by using the yolov2Layers function. Alternatively, you can create the network layers by using yolov2TransformLayer, yolov2ReorgLayer, and yolov2OutputLayer functions. For more details on creating a custom YOLO v2 network, see Design a YOLO v2 Detection Network.

Training options, specified as a TrainingOptionsSGDM, TrainingOptionsRMSProp, or TrainingOptionsADAM object returned by the trainingOptions function. To specify the solver name and other options for network training, use the trainingOptions function.

Note

The trainYOLOv2ObjectDetector function does not support these training options:

  • The 'training-progress' value of the Plots training option

  • The ValidationData, ValidationFrequency, or ValidationPatience training options

  • The OutputFcn option.

  • The trainingOptions Shuffle values, 'once' and 'every-epoch' are not supported when you use a datastore input.

Saved detector checkpoint, specified as a yolov2ObjectDetector object. To save the detector after every epoch, set the 'CheckpointPath' name-value argument when using the trainingOptions function. Saving a checkpoint after every epoch is recommended because network training can take a few hours.

To load a checkpoint for a previously trained detector, load the MAT-file from the checkpoint path. For example, if the CheckpointPath property of the object specified by options is '/checkpath', you can load a checkpoint MAT-file by using this code.

data = load('/checkpath/yolov2_checkpoint__216__2018_11_16__13_34_30.mat');
checkpoint = data.detector;

The name of the MAT-file includes the iteration number and timestamp of when the detector checkpoint was saved. The detector is saved in the detector variable of the file. Pass this file back into the trainYOLOv2ObjectDetector function:

yoloDetector = trainYOLOv2ObjectDetector(trainingData,checkpoint,options);

Previously trained YOLO v2 object detector, specified as a yolov2ObjectDetector object. Use this syntax to continue training a detector with additional training data or to perform more training iterations to improve detector accuracy.

Set of image sizes for multiscale training, specified as an M-by-2 matrix, where each row is of the form [height width]. For each training epoch, the input training images are randomly resized to one of the M image sizes specified in this set.

If you do not specify the trainingSizes, the function sets this value to the size in the image input layer of the YOLO v2 network. The network resizes all training images to this value.

Note

The input trainingSizes values specified for multiscale training must be greater than or equal to the input size in the image input layer of the lgraph input argument.

Output Arguments

collapse all

Trained YOLO v2 object detector, returned as yolov2ObjectDetector object. You can train a YOLO v2 object detector to detect multiple object classes.

Training information, returned as a structure array with four elements. Each element corresponds to a stage of training and contains these fields:

  • TrainingLoss — Training loss at each iteration is the mean squared error (MSE) calculated as the sum of localization error, confidence loss, and classification loss. For more information about the training loss function, see Training Loss.

  • TrainingRMSE — Training root mean squared error (RMSE) is the RMSE calculated from the training loss at each iteration.

  • BaseLearnRate — Learning rate at each iteration.

Each field is a numeric vector with one element per training iteration. Values that have not been calculated at a specific iteration are assigned as NaN.

More About

collapse all

Data Preprocessing

By default, the trainYOLOv2ObjectDetector function preprocesses the training images by:

  • Resizing the input images to match the input size of the network.

  • Normalizing the pixel values of the input images to lie in the range [0, 1].

When you specify the training data by using a table, the trainYOLOv2ObjectDetector function performs data augmentation for preprocessing. The function augments the input dataset by:

  • Reflecting the training data horizontally. The probability for horizontally flipping each image in the training data is 0.5.

  • Uniformly scaling (zooming) the training data by a scale factor that is randomly picked from a continuous uniform distribution in the range [1, 1.1].

  • Random color jittering for brightness, hue, saturation, and contrast.

When you specify the training data by using a datastore, the trainYOLOv2ObjectDetector function does not perform data augmentation. Instead you can augment the training data in datastore by using the transform function and then, train the network with the augmented training data. For more information on how to apply augmentation while using datastores, see Apply Augmentation to Training Data in Datastores (Deep Learning Toolbox).

Training Loss

During training, the YOLO v2 object detection network optimizes the MSE loss between the predicted bounding boxes and the ground truth. The loss function is defined as

where:

  • S is the number of grid cells.

  • B is the number of bounding boxes in each grid cell.

  • is 1 if the jth bounding box in grid cell i is responsible for detecting the object. Otherwise it is set to 0. A grid cell i is responsible for detecting the object, if the overlap between the ground truth and a bounding box in that grid cell is greater than or equal to 0.6.

  • is 1 if the jth bounding box in grid cell i does not contain any object. Otherwise it is set to 0.

  • is 1 if an object is detected in grid cell i. Otherwise it is set to 0.

  • K1, K2, K3, and K4 are the weights. To adjust the weights, modify the LossFactors property of the output layer by using the yolov2OutputLayer function.

The loss function can be split into three parts:

  • Localization loss

    The first and second terms in the loss function comprise the localization loss. It measures error between the predicted bounding box and the ground truth. The parameters for computing the localization loss include the position, size of the predicted bounding box, and the ground truth. The parameters are defined as follows.

    • , is the center of the jth bounding box relative to grid cell i.

    • , is the center of the ground truth relative to grid cell i.

    • is the width and the height of the jth bounding box in grid cell i, respectively. The size of the predicted bounding box is specified relative to the input image size.

    • is the width and the height of the ground truth in grid cell i, respectively.

    • K1 is the weight for localization loss. Increase this value to increase the weightage for bounding box prediction errors.

  • Confidence loss

    The third and fourth terms in the loss function comprise the confidence loss. The third term measures the objectness (confidence score) error when an object is detected in the jth bounding box of grid cell i. The fourth term measures the objectness error when no object is detected in the jth bounding box of grid cell i. The parameters for computing the confidence loss are defined as follows.

    • Ci is the confidence score of the jth bounding box in grid cell i.

    • Ĉi is the confidence score of the ground truth in grid cell i.

    • K2 is the weight for objectness error, when an object is detected in the predicted bounding box. You can adjust the value of K2 to weigh confidence scores from grid cells that contain objects.

    • K3 is the weight for objectness error, when an object is not detected in the predicted bounding box. You can adjust the value of K3 to weigh confidence scores from grid cells that do not contain objects.

    The confidence loss can cause the training to diverge when the number of grid cells that do not contain objects is more than the number of grid cells that contain objects. To remedy this, increase the value for K2 and decrease the value for K3.

  • Classification loss

    The fifth term in the loss function comprises the classification loss. For example, suppose that an object is detected in the predicted bounding box contained in grid cell i. Then, the classification loss measures the squared error between the class conditional probabilities for each class in grid cell i. The parameters for computing the classification loss are defined as follows.

    • pi (c) is the estimated conditional class probability for object class c in grid cell i.

    • is the actual conditional class probability for object class c in grid cell i.

    • K4 is the weight for classification error when an object is detected in the grid cell. Increase this value to increase the weightage for classification loss.

Tips

  • To generate the ground truth, use the Image Labeler or Video Labeler app. To create a table of training data from the generated ground truth, use the objectDetectorTrainingData function.

  • To improve prediction accuracy,

    • Increase the number of images you can use to train the network. You can expand the training dataset through data augmentation. For information on how to apply data augmentation for preprocessing, see Preprocess Images for Deep Learning (Deep Learning Toolbox).

    • Perform multiscale training by using the trainYOLOv2ObjectDetector function. To do so, specify the 'TrainingImageSize' argument of trainYOLOv2ObjectDetector function for training the network.

    • Choose anchor boxes appropriate to the dataset for training the network. You can use the estimateAnchorBoxes function to compute anchor boxes directly from the training data.

References

[1] Joseph. R, S. K. Divvala, R. B. Girshick, and F. Ali. "You Only Look Once: Unified, Real-Time Object Detection." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779–788. Las Vegas, NV: CVPR, 2016.

[2] Joseph. R and F. Ali. "YOLO 9000: Better, Faster, Stronger." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6517–6525. Honolulu, HI: CVPR, 2017.

Introduced in R2019a