Main Content

Deploy Quantized Network Example

This example shows how to train, compile, and deploy a dlhdl.Workflow object that has quantized Alexnet as the network object by using the Deep Learning HDL Toolbox™ Support Package for Xilinx FPGA and SoC. Quantization helps reduce the memory requirement of a deep neural network by quantizing weights, biases and activations of network layers to 8-bit scaled integer data types. Use MATLAB® to retrieve the prediction results from the target device.

Required Products

For this example, you need:

  • Deep Learning Toolbox ™

  • Deep Learning HDL Toolbox ™

  • Deep Learning Toolbox Model Quantization Library

  • Deep Learning HDL Toolbox Support Package for Xilinx FPGA and SoC Devices

  • MATLAB Coder Interface for Deep Learning Libraries.

Load Pretrained SeriesNetwork

To load the pretrained series network AlexNet, enter:

snet = alexnet;

To view the layers of the pretrained series network, enter:

analyzeNetwork(snet);

The first layer, the image input layer, requires input images of size 227-by-227-by-3, where 3 is the number of color channels.

inputSize = snet.Layers(1).InputSize;

inputSize = 1×3

227 227 3

Define Training and Validation Data Sets

This example uses the logos_dataset data set. The data set consists of 320 images. Create an augmentedImageDatastore object to use for training and validation.

curDir = pwd;
newDir = fullfile(matlabroot,'examples','deeplearning_shared','data','logos_dataset.zip');
copyfile(newDir,curDir,'f');

unzip('logos_dataset.zip');

imds = imageDatastore('logos_dataset', ...
    'IncludeSubfolders',true, ...
    'LabelSource','foldernames');

[imdsTrain,imdsValidation] = splitEachLabel(imds,0.7,'randomized');

Replace Final Layers

The last three layers of the pretrained network net are configured for 1000 classes. These three layers must be fine-tuned for the new classification problem. Extract all the layers, except the last three layers, from the pretrained network.

layersTransfer = snet.Layers(1:end-3);

Transfer the layers to the new classification task by replacing the last three layers with a fully connected layer, a softmax layer, and a classification output layer. Set the fully connected layer to have the same size as the number of classes in the new data.

numClasses = numel(categories(imdsTrain.Labels));

numClasses = 32

layers = [
    layersTransfer
    fullyConnectedLayer(numClasses,'WeightLearnRateFactor',20,'BiasLearnRateFactor',20)
    softmaxLayer
    classificationLayer];

Train Network

The network requires input images of size 227-by-227-by-3, but the images in the image datastores have different sizes. Use an augmented image datastore to automatically resize the training images. Specify additional augmentation operations to perform on the training images, such as randomly flipping the training images along the vertical axis and randomly translating them up to 30 pixels horizontally and vertically. Data augmentation helps prevent the network from overfitting and memorizing the exact details of the training images.

pixelRange = [-30 30];
imageAugmenter = imageDataAugmenter( ...
    'RandXReflection',true, ...
    'RandXTranslation',pixelRange, ...
    'RandYTranslation',pixelRange);
augimdsTrain = augmentedImageDatastore(inputSize(1:2),imdsTrain, ...
    'DataAugmentation',imageAugmenter);

To automatically resize the validation images without performing further data augmentation, use an augmented image datastore without specifying any additional preprocessing operations.

augimdsValidation = augmentedImageDatastore(inputSize(1:2),imdsValidation);

Specify the training options. For transfer learning, keep the features from the early layers of the pretrained network (the transferred layer weights). To slow down learning in the transferred layers, set the initial learning rate to a small value. Specify the mini-batch size and validation data. The software validates the network every ValidationFrequency iterations during training.

options = trainingOptions('sgdm', ...
    'MiniBatchSize',10, ...
    'MaxEpochs',6, ...
    'InitialLearnRate',1e-4, ...
    'Shuffle','every-epoch', ...
    'ValidationData',augimdsValidation, ...
    'ValidationFrequency',3, ...
    'Verbose',false, ...
    'Plots','training-progress');

Train the network that consists of the transferred and new layers. By default, trainNetwork uses a GPU if one is available (requires Parallel Computing Toolbox™ and a supported GPU device. For more information, see GPU Support by Release (Parallel Computing Toolbox)). Otherwise, the network uses a CPU (requires MATLAB Coder Interface for Deep learning Libraries™). You can also specify the execution environment by using the 'ExecutionEnvironment' name-value argument of trainingOptions.

netTransfer = trainNetwork(augimdsTrain,layers,options);

Create dlquantizer Object

Create a dlquantizer object and specify the network to quantize. Specify the execution environment as FPGA.

dlQuantObj = dlquantizer(netTransfer,'ExecutionEnvironment','FPGA');

Calibrate Quantized Network

The dlquantizer object uses calibration data to collect dynamic ranges for the learnable parameters of the convolution and fully connected layers of the network.

For best quantization results, the calibration data must be a representative of actual inputs predicted by the LogoNet network. Expedite the calibration process by reducing the calibration data set to 20 images.

imageData = imageDatastore(fullfile(curDir,'logos_dataset'),...
 'IncludeSubfolders',true,'FileExtensions','.JPG','LabelSource','foldernames');
imageData_reduced = imageData.subset(1:20);
dlQuantObj.calibrate(imageData_reduced)

Create Target Object

Create a target object with a custom name for your target device and an interface to connect your target device to the host computer. Interface options are JTAG and Ethernet. To use JTAG, install Xilinx™ Vivado™ Design Suite 2020.1. To set the Xilinx Vivado toolpath, enter:

% hdlsetuptoolpath('ToolName', 'Xilinx Vivado', 'ToolPath', 'C:\Xilinx\Vivado\2020.1\bin\vivado.bat');

To create the target object, enter:

hTarget = dlhdl.Target('Xilinx','Interface','Ethernet');

Alternatively, you can also use the JTAG interface.

% hTarget = dlhdl.Target('Xilinx', 'Interface', 'JTAG');

Create Workflow Object

Create an object of the dlhdl.Workflow class. When you create the class, an instance of the dlquantizer object, the bitstream name, and the target information are specified. Specify dlQuantObj as the network. Make sure that the bitstream name matches the data type and the FPGA board that you are targeting. In this example, the target FPGA board is the Xilinx ZCU102 SOC board and the bitstream uses the int8 data type.

hW = dlhdl.Workflow('Network', dlQuantObj, 'Bitstream', 'zcu102_int8','Target',hTarget);

Compile the Quantized Series Network

To compile the quantized AlexNet series network, run the compile function of the dlhdl.Workflow object.

dn = hW.compile
### Compiling network for Deep Learning FPGA prototyping ...
### Targeting FPGA bitstream zcu102_int8 ...
### The network includes the following layers:

     1   'data'          Image Input                   227×227×3 images with 'zerocenter' normalization                                  (SW Layer)
     2   'conv1'         Convolution                   96 11×11×3 convolutions with stride [4  4] and padding [0  0  0  0]               (HW Layer)
     3   'relu1'         ReLU                          ReLU                                                                              (HW Layer)
     4   'norm1'         Cross Channel Normalization   cross channel normalization with 5 channels per element                           (HW Layer)
     5   'pool1'         Max Pooling                   3×3 max pooling with stride [2  2] and padding [0  0  0  0]                       (HW Layer)
     6   'conv2'         Grouped Convolution           2 groups of 128 5×5×48 convolutions with stride [1  1] and padding [2  2  2  2]   (HW Layer)
     7   'relu2'         ReLU                          ReLU                                                                              (HW Layer)
     8   'norm2'         Cross Channel Normalization   cross channel normalization with 5 channels per element                           (HW Layer)
     9   'pool2'         Max Pooling                   3×3 max pooling with stride [2  2] and padding [0  0  0  0]                       (HW Layer)
    10   'conv3'         Convolution                   384 3×3×256 convolutions with stride [1  1] and padding [1  1  1  1]              (HW Layer)
    11   'relu3'         ReLU                          ReLU                                                                              (HW Layer)
    12   'conv4'         Grouped Convolution           2 groups of 192 3×3×192 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    13   'relu4'         ReLU                          ReLU                                                                              (HW Layer)
    14   'conv5'         Grouped Convolution           2 groups of 128 3×3×192 convolutions with stride [1  1] and padding [1  1  1  1]  (HW Layer)
    15   'relu5'         ReLU                          ReLU                                                                              (HW Layer)
    16   'pool5'         Max Pooling                   3×3 max pooling with stride [2  2] and padding [0  0  0  0]                       (HW Layer)
    17   'fc6'           Fully Connected               4096 fully connected layer                                                        (HW Layer)
    18   'relu6'         ReLU                          ReLU                                                                              (HW Layer)
    19   'drop6'         Dropout                       50% dropout                                                                       (HW Layer)
    20   'fc7'           Fully Connected               4096 fully connected layer                                                        (HW Layer)
    21   'relu7'         ReLU                          ReLU                                                                              (HW Layer)
    22   'drop7'         Dropout                       50% dropout                                                                       (HW Layer)
    23   'fc'            Fully Connected               32 fully connected layer                                                          (HW Layer)
    24   'softmax'       Softmax                       softmax                                                                           (SW Layer)
    25   'classoutput'   Classification Output         crossentropyex with 'adidas' and 31 other classes                                 (SW Layer)

3 Memory Regions created.

Skipping: data
Compiling leg: conv1>>pool5 ...
Compiling leg: conv1>>pool5 ... complete.
Compiling leg: fc6>>fc ...
Compiling leg: fc6>>fc ... complete.
Skipping: softmax
Skipping: classoutput
Creating Schedule...
.........
Creating Schedule...complete.
Creating Status Table...
........
Creating Status Table...complete.
Emitting Schedule...
......
Emitting Schedule...complete.
Emitting Status Table...
..........
Emitting Status Table...complete.

### Allocating external memory buffers:

          offset_name          offset_address     allocated_space 
    _______________________    ______________    _________________

    "InputDataOffset"           "0x00000000"     "48.0 MB"        
    "OutputResultOffset"        "0x03000000"     "4.0 MB"         
    "SchedulerDataOffset"       "0x03400000"     "4.0 MB"         
    "SystemBufferOffset"        "0x03800000"     "28.0 MB"        
    "InstructionDataOffset"     "0x05400000"     "4.0 MB"         
    "ConvWeightDataOffset"      "0x05800000"     "8.0 MB"         
    "FCWeightDataOffset"        "0x06000000"     "56.0 MB"        
    "EndOffset"                 "0x09800000"     "Total: 152.0 MB"

### Network compilation complete.
dn = struct with fields:
             weights: [1×1 struct]
        instructions: [1×1 struct]
           registers: [1×1 struct]
    syncInstructions: [1×1 struct]

Program Bitstream onto FPGA and Download Network Weights

To deploy the network on the Xilinx ZCU102 SoC hardware, run the deploy function of the dlhdl.Workflow object. This function uses the output of the compile function to program the FPGA board by using the programming file. It also downloads the network weights and biases. The deploy function starts programming the FPGA device, displays progress messages, and the time it takes to deploy the network.

hW.deploy
### FPGA bitstream programming has been skipped as the same bitstream is already loaded on the target FPGA.
### Loading weights to Conv Processor.
### Conv Weights loaded. Current time is 17-Dec-2020 11:06:56
### Loading weights to FC Processor.
### 33% finished, current time is 17-Dec-2020 11:06:57.
### 67% finished, current time is 17-Dec-2020 11:06:59.
### FC Weights loaded. Current time is 17-Dec-2020 11:06:59

Load Example Images and Run the Prediction

To load the example image, execute the predict function of the dlhdl.Workflow object, and then display the FPGA result, enter:

idx = randperm(numel(imdsValidation.Files),4);
figure
for i = 1:4
    subplot(2,2,i)
    I = readimage(imdsValidation,idx(i));
    imshow(I)
    [prediction, speed] = hW.predict(single(I),'Profile','on');
    [val, index] = max(prediction);
    netTransfer.Layers(end).ClassNames{index}
    label = netTransfer.Layers(end).ClassNames{index}
    title(string(label));
end
### Finished writing input activations.
### Running single input activations.


              Deep Learning Processor Profiler Performance Results

                   LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                         -------------             -------------              ---------        ---------       ---------
Network                    9088267                  0.04131                       1            9088267             24.2
    conv1                   713071                  0.00324 
    norm1                   460546                  0.00209 
    pool1                    88791                  0.00040 
    conv2                   911059                  0.00414 
    norm2                   270230                  0.00123 
    pool2                    92782                  0.00042 
    conv3                   297066                  0.00135 
    conv4                   238155                  0.00108 
    conv5                   166248                  0.00076 
    pool5                    19576                  0.00009 
    fc6                    3955696                  0.01798 
    fc7                    1757863                  0.00799 
    fc                      117059                  0.00053 
 * The clock frequency of the DL processor is: 220MHz
ans = 
'ford'
label = 
'ford'
### Finished writing input activations.
### Running single input activations.


              Deep Learning Processor Profiler Performance Results

                   LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                         -------------             -------------              ---------        ---------       ---------
Network                    9088122                  0.04131                       1            9088122             24.2
    conv1                   713003                  0.00324 
    norm1                   460513                  0.00209 
    pool1                    89083                  0.00040 
    conv2                   910726                  0.00414 
    norm2                   270238                  0.00123 
    pool2                    92773                  0.00042 
    conv3                   297151                  0.00135 
    conv4                   238132                  0.00108 
    conv5                   166415                  0.00076 
    pool5                    19561                  0.00009 
    fc6                    3955517                  0.01798 
    fc7                    1757860                  0.00799 
    fc                      117054                  0.00053 
 * The clock frequency of the DL processor is: 220MHz
ans = 
'bmw'
label = 
'bmw'
### Finished writing input activations.
### Running single input activations.


              Deep Learning Processor Profiler Performance Results

                   LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                         -------------             -------------              ---------        ---------       ---------
Network                    9088305                  0.04131                       1            9088305             24.2
    conv1                   713031                  0.00324 
    norm1                   460263                  0.00209 
    pool1                    88948                  0.00040 
    conv2                   911216                  0.00414 
    norm2                   270247                  0.00123 
    pool2                    92514                  0.00042 
    conv3                   297124                  0.00135 
    conv4                   238252                  0.00108 
    conv5                   166320                  0.00076 
    pool5                    19519                  0.00009 
    fc6                    3955853                  0.01798 
    fc7                    1757867                  0.00799 
    fc                      117055                  0.00053 
 * The clock frequency of the DL processor is: 220MHz
ans = 
'aldi'
label = 
'aldi'
### Finished writing input activations.
### Running single input activations.


              Deep Learning Processor Profiler Performance Results

                   LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                         -------------             -------------              ---------        ---------       ---------
Network                    9088168                  0.04131                       1            9088168             24.2
    conv1                   713087                  0.00324 
    norm1                   460226                  0.00209 
    pool1                    89136                  0.00041 
    conv2                   910865                  0.00414 
    norm2                   270243                  0.00123 
    pool2                    92511                  0.00042 
    conv3                   297117                  0.00135 
    conv4                   238363                  0.00108 
    conv5                   166485                  0.00076 
    pool5                    19504                  0.00009 
    fc6                    3955608                  0.01798 
    fc7                    1757867                  0.00799 
    fc                      117060                  0.00053 
 * The clock frequency of the DL processor is: 220MHz
ans = 
'corona'
label = 
'corona'

See Also