Create a Semantic Segmentation Network
Create a simple semantic segmentation network and learn about common layers found in many semantic segmentation networks. A common pattern in semantic segmentation networks requires the downsampling of an image between convolutional and ReLU layers, and then upsample the output to match the input size. This operation is analogous to the standard scale-space analysis using image pyramids. During this process however, a network performs the operations using non-linear filters optimized for a specific set of classes that you want to segment.
Create Image Input Layer
A semantic segmentation network starts with an imageInputLayer
, which defines the smallest image size the network can process. Most semantic segmentation networks are fully convolutional, which means they can process images that are larger than the specified input size. Here, an image size of [32 32 3] is used for the network to process 64-by-64 RGB images.
inputSize = [32 32 3]; imgLayer = imageInputLayer(inputSize)
imgLayer = ImageInputLayer with properties: Name: '' InputSize: [32 32 3] SplitComplexInputs: 0 Hyperparameters DataAugmentation: 'none' Normalization: 'zerocenter' NormalizationDimension: 'auto' Mean: []
Create Downsampling Network
Start with the convolution and ReLU layers. The convolution layer padding is selected such that the output size of the convolution layer is the same as the input size. This makes it easier to construct a network because the input and output sizes between most layers remain the same as you progress through the network.
filterSize = 3; numFilters = 32; conv = convolution2dLayer(filterSize,numFilters,Padding=1); relu = reluLayer();
The downsampling is performed using a max pooling layer. Create a max pooling layer to downsample the input by a factor of 2 by setting the 'Stride
' parameter to 2.
poolSize = 2; maxPoolDownsample2x = maxPooling2dLayer(poolSize,Stride=2);
Stack the convolution, ReLU, and max pooling layers to create a network that downsamples its input by a factor of 4.
downsamplingLayers = [ conv relu maxPoolDownsample2x conv relu maxPoolDownsample2x ]
downsamplingLayers = 6x1 Layer array with layers: 1 '' 2-D Convolution 32 3x3 convolutions with stride [1 1] and padding [1 1 1 1] 2 '' ReLU ReLU 3 '' 2-D Max Pooling 2x2 max pooling with stride [2 2] and padding [0 0 0 0] 4 '' 2-D Convolution 32 3x3 convolutions with stride [1 1] and padding [1 1 1 1] 5 '' ReLU ReLU 6 '' 2-D Max Pooling 2x2 max pooling with stride [2 2] and padding [0 0 0 0]
Create Upsampling Network
The upsampling is done using the transposed convolution layer (also commonly referred to as "deconv" or "deconvolution" layer). When a transposed convolution is used for upsampling, it performs the upsampling and the filtering at the same time.
Create a transposed convolution layer to upsample by 2.
filterSize = 4; transposedConvUpsample2x = transposedConv2dLayer(4,numFilters,Stride=2,Cropping=1);
The 'Cropping' parameter is set to 1 to make the output size equal twice the input size.
Stack the transposed convolution and relu layers. An input to this set of layers is upsampled by 4.
upsamplingLayers = [ transposedConvUpsample2x relu transposedConvUpsample2x relu ]
upsamplingLayers = 4x1 Layer array with layers: 1 '' 2-D Transposed Convolution 32 4x4 transposed convolutions with stride [2 2] and cropping [1 1 1 1] 2 '' ReLU ReLU 3 '' 2-D Transposed Convolution 32 4x4 transposed convolutions with stride [2 2] and cropping [1 1 1 1] 4 '' ReLU ReLU
Create Final Layers for Pixel Classification
The final set of layers are responsible for making pixel classifications. These final layers process an input that has the same spatial dimensions (height and width) as the input image. However, the number of channels (third dimension) is larger and is equal to number of filters in the last transposed convolution layer. This third dimension needs to be squeezed down to the number of classes you want to segment. This can be done using a 1-by-1 convolution layer whose number of filters equal the number of classes, such as 3.
Create a convolution layer to combine the third dimension of the input feature maps down to the number of classes.
numClasses = 3; conv1x1 = convolution2dLayer(1,numClasses);
Following this 1-by-1 convolution layer is a softmax layer. This layer applies a softmax activation function that normalizes the output of the fully connected layer. The output of the softmax layer consists of positive numbers that sum to one, which can be considered as classification probabilities.
finalLayers = [ conv1x1 softmaxLayer() ]
finalLayers = 2x1 Layer array with layers: 1 '' 2-D Convolution 3 1x1 convolutions with stride [1 1] and padding [0 0 0 0] 2 '' Softmax softmax
Stack All Layers
Stack all the layers to complete the semantic segmentation network.
net = [ imgLayer downsamplingLayers upsamplingLayers finalLayers ]
net = 13x1 Layer array with layers: 1 '' Image Input 32x32x3 images with 'zerocenter' normalization 2 '' 2-D Convolution 32 3x3 convolutions with stride [1 1] and padding [1 1 1 1] 3 '' ReLU ReLU 4 '' 2-D Max Pooling 2x2 max pooling with stride [2 2] and padding [0 0 0 0] 5 '' 2-D Convolution 32 3x3 convolutions with stride [1 1] and padding [1 1 1 1] 6 '' ReLU ReLU 7 '' 2-D Max Pooling 2x2 max pooling with stride [2 2] and padding [0 0 0 0] 8 '' 2-D Transposed Convolution 32 4x4 transposed convolutions with stride [2 2] and cropping [1 1 1 1] 9 '' ReLU ReLU 10 '' 2-D Transposed Convolution 32 4x4 transposed convolutions with stride [2 2] and cropping [1 1 1 1] 11 '' ReLU ReLU 12 '' 2-D Convolution 3 1x1 convolutions with stride [1 1] and padding [0 0 0 0] 13 '' Softmax softmax
This network is ready to be trained using trainnet
from Deep Learning Toolbox™.
See Also
trainnet
(Deep Learning Toolbox)