Semantic Segmentation With Deep Learning
Analyze Training Data for Semantic Segmentation
To train a semantic segmentation network you need a collection of images and its corresponding collection of pixel labeled images. A pixel labeled image is an image where every pixel value represents the categorical label of that pixel.
The following code loads a small set of images and their corresponding pixel labeled images:
dataDir = fullfile(toolboxdir('vision'),'visiondata'); imDir = fullfile(dataDir,'building'); pxDir = fullfile(dataDir,'buildingPixelLabels');
Load the image data using an imageDatastore
. An image datastore can efficiently represent a large collection of images because images are only read into memory when needed.
imds = imageDatastore(imDir);
Read and display the first image.
I = readimage(imds,1); figure imshow(I)
Load the pixel label images using a pixelLabelDatastore
to define the mapping between label IDs and categorical names. In the dataset used here, the labels are "sky", "grass", "building", and "sidewalk". The label IDs for these classes are 1, 2, 3, 4, respectively.
Define the class names.
classNames = ["sky" "grass" "building" "sidewalk"];
Define the label ID for each class name.
pixelLabelID = [1 2 3 4];
Create a pixelLabelDatastore
.
pxds = pixelLabelDatastore(pxDir,classNames,pixelLabelID);
Read the first pixel label image.
C = readimage(pxds,1);
The output C
is a categorical matrix where C(i,j)
is the categorical label of pixel I(i,j)
.
C(5,5)
ans = categorical
sky
Overlay the pixel labels on the image to see how different parts of the image are labeled.
B = labeloverlay(I,C); figure imshow(B)
The categorical output format simplifies tasks that require doing things by class names. For instance, you can create a binary mask of just the building:
buildingMask = C == 'building'; figure imshowpair(I, buildingMask,'montage')
Create a Semantic Segmentation Network
Create a simple semantic segmentation network and learn about common layers found in many semantic segmentation networks. A common pattern in semantic segmentation networks requires the downsampling of an image between convolutional and ReLU layers, and then upsample the output to match the input size. This operation is analogous to the standard scale-space analysis using image pyramids. During this process however, a network performs the operations using non-linear filters optimized for a specific set of classes that you want to segment.
Create An Image Input Layer
A semantic segmentation network starts with an imageInputLayer
, which defines the smallest image size the network can process. Most semantic segmentation networks are fully convolutional, which means they can process images that are larger than the specified input size. Here, an image size of [32 32 3] is used for the network to process 64x64 RGB images.
inputSize = [32 32 3]; imgLayer = imageInputLayer(inputSize)
imgLayer = ImageInputLayer with properties: Name: '' InputSize: [32 32 3] Hyperparameters DataAugmentation: 'none' Normalization: 'zerocenter'
Create Downsampling Network
Start with the convolution and ReLU layers. The convolution layer padding is selected such that the output size of the convolution layer is the same as the input size. This makes it easier to construct a network because the input and output sizes between most layers remain the same as you progress through the network.
filterSize = 3;
numFilters = 32;
conv = convolution2dLayer(filterSize,numFilters,'Padding',1);
relu = reluLayer();
The downsampling is performed using a max pooling layer. Create a max pooling layer to downsample the input by a factor of 2 by setting the 'Stride
' parameter to 2.
poolSize = 2;
maxPoolDownsample2x = maxPooling2dLayer(poolSize,'Stride',2);
Stack the convolution, ReLU, and max pooling layers to create a network that downsamples its input by a factor of 4.
downsamplingLayers = [ conv relu maxPoolDownsample2x conv relu maxPoolDownsample2x ]
downsamplingLayers = 6x1 Layer array with layers: 1 '' Convolution 32 3x3 convolutions with stride [1 1] and padding [1 1 1 1] 2 '' ReLU ReLU 3 '' Max Pooling 2x2 max pooling with stride [2 2] and padding [0 0 0 0] 4 '' Convolution 32 3x3 convolutions with stride [1 1] and padding [1 1 1 1] 5 '' ReLU ReLU 6 '' Max Pooling 2x2 max pooling with stride [2 2] and padding [0 0 0 0]
Create Upsampling Network
The upsampling is done using the tranposed convolution layer (also commonly referred to as "deconv" or "deconvolution" layer). When a transposed convolution is used for upsampling, it performs the upsampling and the filtering at the same time.
Create a transposed convolution layer to upsample by 2.
filterSize = 4; transposedConvUpsample2x = transposedConv2dLayer(4,numFilters,'Stride',2,'Cropping',1);
The 'Cropping' parameter is set to 1 to make the output size equal twice the input size.
Stack the transposed convolution and relu layers. An input to this set of layers is upsampled by 4.
upsamplingLayers = [ transposedConvUpsample2x relu transposedConvUpsample2x relu ]
upsamplingLayers = 4x1 Layer array with layers: 1 '' Transposed Convolution 32 4x4 transposed convolutions with stride [2 2] and output cropping [1 1] 2 '' ReLU ReLU 3 '' Transposed Convolution 32 4x4 transposed convolutions with stride [2 2] and output cropping [1 1] 4 '' ReLU ReLU
Create A Pixel Classification Layer
The final set of layers are responsible for making pixel classifications. These final layers process an input that has the same spatial dimensions (height and width) as the input image. However, the number of channels (third dimension) is larger and is equal to number of filters in the last transposed convolution layer. This third dimension needs to be squeezed down to the number of classes we wish to segment. This can be done using a 1-by-1 convolution layer whose number of filters equal the number of classes, e.g. 3.
Create a convolution layer to combine the third dimension of the input feature maps down to the number of classes.
numClasses = 3; conv1x1 = convolution2dLayer(1,numClasses);
Following this 1-by-1 convolution layer are the softmax and pixel classification layers. These two layers combine to predict the categorical label for each image pixel.
finalLayers = [ conv1x1 softmaxLayer() pixelClassificationLayer() ]
finalLayers = 3x1 Layer array with layers: 1 '' Convolution 3 1x1 convolutions with stride [1 1] and padding [0 0 0 0] 2 '' Softmax softmax 3 '' Pixel Classification Layer Cross-entropy loss
Stack All Layers
Stack all the layers to complete the semantic segmentation network.
net = [ imgLayer downsamplingLayers upsamplingLayers finalLayers ]
net = 14x1 Layer array with layers: 1 '' Image Input 32x32x3 images with 'zerocenter' normalization 2 '' Convolution 32 3x3 convolutions with stride [1 1] and padding [1 1 1 1] 3 '' ReLU ReLU 4 '' Max Pooling 2x2 max pooling with stride [2 2] and padding [0 0 0 0] 5 '' Convolution 32 3x3 convolutions with stride [1 1] and padding [1 1 1 1] 6 '' ReLU ReLU 7 '' Max Pooling 2x2 max pooling with stride [2 2] and padding [0 0 0 0] 8 '' Transposed Convolution 32 4x4 transposed convolutions with stride [2 2] and output cropping [1 1] 9 '' ReLU ReLU 10 '' Transposed Convolution 32 4x4 transposed convolutions with stride [2 2] and output cropping [1 1] 11 '' ReLU ReLU 12 '' Convolution 3 1x1 convolutions with stride [1 1] and padding [0 0 0 0] 13 '' Softmax softmax 14 '' Pixel Classification Layer Cross-entropy loss
This network is ready to be trained using trainNetwork
from Deep Learning Toolbox™.
Train A Semantic Segmentation Network
Load the training data.
dataSetDir = fullfile(toolboxdir('vision'),'visiondata','triangleImages'); imageDir = fullfile(dataSetDir,'trainingImages'); labelDir = fullfile(dataSetDir,'trainingLabels');
Create an image datastore for the images.
imds = imageDatastore(imageDir);
Create a pixelLabelDatastore
for the ground truth pixel labels.
classNames = ["triangle","background"]; labelIDs = [255 0]; pxds = pixelLabelDatastore(labelDir,classNames,labelIDs);
Visualize training images and ground truth pixel labels.
I = read(imds);
C = read(pxds);
I = imresize(I,5);
L = imresize(uint8(C{1}),5);
imshowpair(I,L,'montage')
Create a semantic segmentation network. This network uses a simple semantic segmentation network based on a downsampling and upsampling design.
numFilters = 64; filterSize = 3; numClasses = 2; layers = [ imageInputLayer([32 32 1]) convolution2dLayer(filterSize,numFilters,'Padding',1) reluLayer() maxPooling2dLayer(2,'Stride',2) convolution2dLayer(filterSize,numFilters,'Padding',1) reluLayer() transposedConv2dLayer(4,numFilters,'Stride',2,'Cropping',1); convolution2dLayer(1,numClasses); softmaxLayer() pixelClassificationLayer() ];
Setup training options.
opts = trainingOptions('sgdm', ... 'InitialLearnRate',1e-3, ... 'MaxEpochs',100, ... 'MiniBatchSize',64);
Combine the image and pixel label datastore for training.
trainingData = combine(imds,pxds);
Train the network.
net = trainNetwork(trainingData,layers,opts);
Training on single CPU. Initializing input data normalization. |========================================================================================| | Epoch | Iteration | Time Elapsed | Mini-batch | Mini-batch | Base Learning | | | | (hh:mm:ss) | Accuracy | Loss | Rate | |========================================================================================| | 1 | 1 | 00:00:00 | 58.11% | 1.3458 | 0.0010 | | 17 | 50 | 00:00:12 | 97.30% | 0.0924 | 0.0010 | | 34 | 100 | 00:00:24 | 98.09% | 0.0575 | 0.0010 | | 50 | 150 | 00:00:37 | 98.56% | 0.0424 | 0.0010 | | 67 | 200 | 00:00:49 | 98.48% | 0.0435 | 0.0010 | | 84 | 250 | 00:01:02 | 98.66% | 0.0363 | 0.0010 | | 100 | 300 | 00:01:14 | 98.90% | 0.0310 | 0.0010 | |========================================================================================| Training finished: Reached final iteration.
Read and display a test image.
testImage = imread('triangleTest.jpg');
imshow(testImage)
Segment the test image and display the results.
C = semanticseg(testImage,net); B = labeloverlay(testImage,C); imshow(B)
Evaluate and Inspect the Results of Semantic Segmentation
Import a test data set, run a pretrained semantic segmentation network, and evaluate and inspect semantic segmentation quality metrics for the predicted results.
Import a Data Set
The triangleImages
data set has 100 test images with ground truth labels. Define the location of the data set.
dataSetDir = fullfile(toolboxdir('vision'),'visiondata','triangleImages');
Define the location of the test images.
testImagesDir = fullfile(dataSetDir,'testImages');
Create an imageDatastore
object holding the test images.
imds = imageDatastore(testImagesDir);
Define the location of the ground truth labels.
testLabelsDir = fullfile(dataSetDir,'testLabels');
Define the class names and their associated label IDs. The label IDs are the pixel values used in the image files to represent each class.
classNames = ["triangle" "background"]; labelIDs = [255 0];
Create a pixelLabelDatastore
object holding the ground truth pixel labels for the test images.
pxdsTruth = pixelLabelDatastore(testLabelsDir,classNames,labelIDs);
Run a Semantic Segmentation Classifier
Load a semantic segmentation network that has been trained on the training images of triangleImages
.
net = load('triangleSegmentationNetwork.mat');
net = net.net;
Run the network on the test images. Predicted labels are written to disk in a temporary directory and returned as a pixelLabelDatastore
object.
pxdsResults = semanticseg(imds,net,"WriteLocation",tempdir);
Running semantic segmentation network ------------------------------------- * Processed 100 images.
Evaluate the Quality of the Prediction
The predicted labels are compared to the ground truth labels. While the semantic segmentation metrics are being computed, progress is printed to the Command Window.
metrics = evaluateSemanticSegmentation(pxdsResults,pxdsTruth);
Evaluating semantic segmentation results ---------------------------------------- * Selected metrics: global accuracy, class accuracy, IoU, weighted IoU, BF score. * Processed 100 images. * Finalizing... Done. * Data set metrics: GlobalAccuracy MeanAccuracy MeanIoU WeightedIoU MeanBFScore ______________ ____________ _______ ___________ ___________ 0.90624 0.95085 0.61588 0.87529 0.40652
Inspect Class Metrics
Display the classification accuracy, the intersection over union (IoU), and the boundary F-1 score for each class in the data set.
metrics.ClassMetrics
ans=2×3 table
Accuracy IoU MeanBFScore
________ _______ ___________
triangle 1 0.33005 0.028664
background 0.9017 0.9017 0.78438
Display the Confusion Matrix
Display the confusion matrix.
metrics.ConfusionMatrix
ans=2×2 table
triangle background
________ __________
triangle 4730 0
background 9601 88069
Visualize the normalized confusion matrix as a confusion chart in a figure window.
cm = confusionchart(metrics.ConfusionMatrix.Variables, ... classNames, Normalization='row-normalized'); cm.Title = 'Normalized Confusion Matrix (%)';
Inspect an Image Metric
Visualize the histogram of the per-image intersection over union (IoU).
imageIoU = metrics.ImageMetrics.MeanIoU;
figure
histogram(imageIoU)
title('Image Mean IoU')
Find the test image with the lowest IoU.
[minIoU, worstImageIndex] = min(imageIoU); minIoU = minIoU(1); worstImageIndex = worstImageIndex(1);
Read the test image with the worst IoU, its ground truth labels, and its predicted labels for comparison.
worstTestImage = readimage(imds,worstImageIndex); worstTrueLabels = readimage(pxdsTruth,worstImageIndex); worstPredictedLabels = readimage(pxdsResults,worstImageIndex);
Convert the label images to images that can be displayed in a figure window.
worstTrueLabelImage = im2uint8(worstTrueLabels == classNames(1)); worstPredictedLabelImage = im2uint8(worstPredictedLabels == classNames(1));
Display the worst test image, the ground truth, and the prediction.
worstMontage = cat(4,worstTestImage,worstTrueLabelImage,worstPredictedLabelImage); worstMontage = imresize(worstMontage,4,"nearest"); figure montage(worstMontage,'Size',[1 3]) title(['Test Image vs. Truth vs. Prediction. IoU = ' num2str(minIoU)])
Similarly, find the test image with the highest IoU.
[maxIoU, bestImageIndex] = max(imageIoU); maxIoU = maxIoU(1); bestImageIndex = bestImageIndex(1);
Repeat the previous steps to read, convert, and display the test image with the best IoU with its ground truth and predicted labels.
bestTestImage = readimage(imds,bestImageIndex); bestTrueLabels = readimage(pxdsTruth,bestImageIndex); bestPredictedLabels = readimage(pxdsResults,bestImageIndex); bestTrueLabelImage = im2uint8(bestTrueLabels == classNames(1)); bestPredictedLabelImage = im2uint8(bestPredictedLabels == classNames(1)); bestMontage = cat(4,bestTestImage,bestTrueLabelImage,bestPredictedLabelImage); bestMontage = imresize(bestMontage,4,"nearest"); figure montage(bestMontage,'Size',[1 3]) title(['Test Image vs. Truth vs. Prediction. IoU = ' num2str(maxIoU)])
Specify Metrics to Evaluate
Optionally, list the metric(s) you would like to evaluate using the 'Metrics'
parameter.
Define the metrics to compute.
evaluationMetrics = ["accuracy" "iou"];
Compute these metrics for the triangleImages
test data set.
metrics = evaluateSemanticSegmentation(pxdsResults,pxdsTruth,"Metrics",evaluationMetrics);
Evaluating semantic segmentation results ---------------------------------------- * Selected metrics: class accuracy, IoU. * Processed 100 images. * Finalizing... Done. * Data set metrics: MeanAccuracy MeanIoU ____________ _______ 0.95085 0.61588
Display the chosen metrics for each class.
metrics.ClassMetrics
ans=2×2 table
Accuracy IoU
________ _______
triangle 1 0.33005
background 0.9017 0.9017
Import Pixel Labeled Dataset For Semantic Segmentation
This example shows you how to import a pixel labeled dataset for semantic segmentation networks.
A pixel labeled dataset is a collection of images and a corresponding set of ground truth pixel labels used for training semantic segmentation networks. There are many public datasets that provide annotated images with per-pixel labels. To illustrate the steps for importing these types of datasets, the example uses the CamVid dataset from the University of Cambridge [1].
The CamVid dataset is a collection of images containing street level views obtained while driving. The dataset provides pixel-level labels for 32 semantic classes including car, pedestrian, and road. The steps shown to import CamVid can be used to import other pixel labeled datasets.
Download CamVid Dataset
Download the CamVid image data from the following URLs:
imageURL = 'http://web4.cs.ucl.ac.uk/staff/g.brostow/MotionSegRecData/files/701_StillsRaw_full.zip'; labelURL = 'http://web4.cs.ucl.ac.uk/staff/g.brostow/MotionSegRecData/data/LabeledApproved_full.zip'; outputFolder = fullfile(tempdir, 'CamVid'); imageDir = fullfile(outputFolder,'images'); labelDir = fullfile(outputFolder,'labels'); if ~exist(outputFolder, 'dir') disp('Downloading 557 MB CamVid data set...'); unzip(imageURL, imageDir); unzip(labelURL, labelDir); end
Note: Download time of the data depends on your internet connection. The commands used above will block MATLAB® until the download is complete. Alternatively, you can use your web browser to first download the dataset to your local disk. To use the file you downloaded from the web, change the outputFolder
variable above to the location of the downloaded file.
CamVid Pixel Labels
The CamVid data set encodes the pixel labels as RGB images, where each class is represented by an RGB color. Here are the classes the dataset defines along with their RGB encodings.
classNames = [ ... "Animal", ... "Archway", ... "Bicyclist", ... "Bridge", ... "Building", ... "Car", ... "CartLuggagePram", ... "Child", ... "Column_Pole", ... "Fence", ... "LaneMkgsDriv", ... "LaneMkgsNonDriv", ... "Misc_Text", ... "MotorcycleScooter", ... "OtherMoving", ... "ParkingBlock", ... "Pedestrian", ... "Road", ... "RoadShoulder", ... "Sidewalk", ... "SignSymbol", ... "Sky", ... "SUVPickupTruck", ... "TrafficCone", ... "TrafficLight", ... "Train", ... "Tree", ... "Truck_Bus", ... "Tunnel", ... "VegetationMisc", ... "Wall"];
Define the mapping between label indices and class names such that classNames(k)
corresponds to labelIDs(k,:)
.
labelIDs = [ ... 064 128 064; ... % "Animal" 192 000 128; ... % "Archway" 000 128 192; ... % "Bicyclist" 000 128 064; ... % "Bridge" 128 000 000; ... % "Building" 064 000 128; ... % "Car" 064 000 192; ... % "CartLuggagePram" 192 128 064; ... % "Child" 192 192 128; ... % "Column_Pole" 064 064 128; ... % "Fence" 128 000 192; ... % "LaneMkgsDriv" 192 000 064; ... % "LaneMkgsNonDriv" 128 128 064; ... % "Misc_Text" 192 000 192; ... % "MotorcycleScooter" 128 064 064; ... % "OtherMoving" 064 192 128; ... % "ParkingBlock" 064 064 000; ... % "Pedestrian" 128 064 128; ... % "Road" 128 128 192; ... % "RoadShoulder" 000 000 192; ... % "Sidewalk" 192 128 128; ... % "SignSymbol" 128 128 128; ... % "Sky" 064 128 192; ... % "SUVPickupTruck" 000 000 064; ... % "TrafficCone" 000 064 064; ... % "TrafficLight" 192 064 128; ... % "Train" 128 128 000; ... % "Tree" 192 128 192; ... % "Truck_Bus" 064 000 064; ... % "Tunnel" 192 192 000; ... % "VegetationMisc" 064 192 000]; % "Wall"
Note that other datasets have different formats of encoding data. For example, the PASCAL VOC [2] dataset uses numeric label IDs between 0 and 21 to encode their class labels.
Visualize the pixel labels for one of the CamVid images.
labels = imread(fullfile(labelDir,'0001TP_006690_L.png')); figure imshow(labels) % Add colorbar to show class to color mapping. N = numel(classNames); ticks = 1/(N*2):1/N:1; colorbar('TickLabels',cellstr(classNames),'Ticks',ticks,'TickLength',0,'TickLabelInterpreter','none'); colormap(labelIDs./255)
Load CamVid Data
A pixel labeled dataset can be loaded using an imageDatastore
and a pixelLabelDatastore
.
Create an imageDatastore
to load the CamVid images.
imds = imageDatastore(fullfile(imageDir,'701_StillsRaw_full'));
Create a pixelLabelDatastore
to load the CamVid pixel labels.
pxds = pixelLabelDatastore(labelDir,classNames,labelIDs);
Read the 10th image and corresponding pixel label image.
I = readimage(imds,10); C = readimage(pxds,10);
The pixel label image is returned as a categorical array where C(i,j)
is the categorical label assigned to pixel I(i,j)
. Display the pixel label image on top of the image.
B = labeloverlay(I,C,'Colormap',labelIDs./255); figure imshow(B) % Add a colorbar. N = numel(classNames); ticks = 1/(N*2):1/N:1; colorbar('TickLabels',cellstr(classNames),'Ticks',ticks,'TickLength',0,'TickLabelInterpreter','none'); colormap(labelIDs./255)
Undefined or Void Labels
It is common for pixel labeled datasets to include "undefined" or "void" labels. These are used to designate pixels that were not labeled. For example, in CamVid, the label ID [0 0 0] is used to designate the "void" class. Training algorithms and evaluation algorithms are not expected to include these labels in any computations.
The "void" class need not be explicitly named when using pixelLabelDatastore
. Any label ID that is not mapped to a class name is automatically labeled "undefined" and is excluded from computations. To see the undefined pixels, use isundefined
to create a mask and then display it on top of the image.
undefinedPixels = isundefined(C);
B = labeloverlay(I,undefinedPixels);
figure
imshow(B)
title('Undefined Pixel Labels')
Combine Classes
When working with public datasets, you may need to combine some of the classes to better suit your application. For example, you may want to train a semantic segmentation network that segments a scene into 4 classes: road, sky, vehicle, pedestrian, and background. To do this with the CamVid dataset, group the label IDs defined above to fit the new classes. First, define the new class names.
newClassNames = ["road","sky","vehicle","pedestrian","background"];
Next, group label IDs using a cell array of M-by-3 matrices.
groupedLabelIDs = { % road [ 128 064 128; ... % "Road" 128 000 192; ... % "LaneMkgsDriv" 192 000 064; ... % "LaneMkgsNonDriv" 000 000 192; ... % "Sidewalk" 064 192 128; ... % "ParkingBlock" 128 128 192; ... % "RoadShoulder" ] % "sky" [ 128 128 128; ... % "Sky" ] % "vehicle" [ 064 000 128; ... % "Car" 064 128 192; ... % "SUVPickupTruck" 192 128 192; ... % "Truck_Bus" 192 064 128; ... % "Train" 000 128 192; ... % "Bicyclist" 192 000 192; ... % "MotorcycleScooter" 128 064 064; ... % "OtherMoving" ] % "pedestrian" [ 064 064 000; ... % "Pedestrian" 192 128 064; ... % "Child" 064 000 192; ... % "CartLuggagePram" 064 128 064; ... % "Animal" ] % "background" [ 128 128 000; ... % "Tree" 192 192 000; ... % "VegetationMisc" 192 128 128; ... % "SignSymbol" 128 128 064; ... % "Misc_Text" 000 064 064; ... % "TrafficLight" 064 064 128; ... % "Fence" 192 192 128; ... % "Column_Pole" 000 000 064; ... % "TrafficCone" 000 128 064; ... % "Bridge" 128 000 000; ... % "Building" 064 192 000; ... % "Wall" 064 000 064; ... % "Tunnel" 192 000 128; ... % "Archway" ] };
Create a pixelLabelDatastore
using the new class and label IDs.
pxds = pixelLabelDatastore(labelDir,newClassNames,groupedLabelIDs);
Read the 10th pixel label image and display it on top of the image.
C = readimage(pxds,10); cmap = jet(numel(newClassNames)); B = labeloverlay(I,C,'Colormap',cmap); figure imshow(B) % add colorbar N = numel(newClassNames); ticks = 1/(N*2):1/N:1; colorbar('TickLabels',cellstr(newClassNames),'Ticks',ticks,'TickLength',0,'TickLabelInterpreter','none'); colormap(cmap)
The pixelLabelDatastore
with the new class names can now be used to train a network for the 4 classes without having to modify the original CamVid pixel labels.
References
[1] Brostow, Gabriel J., Julien Fauqueur, and Roberto Cipolla. "Semantic object classes in video: A high-definition ground truth database." Pattern Recognition Letters 30.2 (2009): 88-97.
[2] Everingham, M., et al. "The PASCAL visual object classes challenge 2012 results." See http://www. pascal-network. org/challenges/VOC/voc2012/workshop/index. html. Vol. 5. 2012.