Main Content

Custom Deep Learning Processor Generation to Meet Performance Requirements

This example shows how to create a custom processor configuration and estimate the performance of a pretrained series network. You can then modify parameters of the custom processor configuration and re-estimate the performance. Once you have achieved your performance requirements you can generate a custom bitstream by using the custom processor configuration.

Prerequisites

  • Deep Learning HDL Toolbox™Support Package for Xilinx FPGA and SoC

  • Deep Learning Toolbox™

  • Deep Learning HDL Toolbox™

  • Deep Learning Toolbox Model Quantization Library

  • MATLAB Coder Interface for Deep Learning

Load Pretrained Series Network

To load the pretrained series network LogoNet, enter:

snet = getLogoNetwork;

Define Training and Validation Data Sets

This example uses the logos_dataset data set. The data set consists of 320 images. Create an augmentedImageDatastore object to use for training and validation.

curDir = pwd;
unzip('logos_dataset.zip');

imds = imageDatastore('logos_dataset', ...
    'IncludeSubfolders',true, ...
    'LabelSource','foldernames');

[imdsTrain,imdsValidation] = splitEachLabel(imds,0.7,'randomized');

Create Custom Processor Configuration

To create a custom processor configuration, use the dlhdl.ProcessorConfig object. For more information, see dlhdl.ProcessorConfig. To learn about modifiable parameters of the processor configuration, see getModuleProperty and setModuleProperty.

hPC = dlhdl.ProcessorConfig;
hPC.TargetFrequency = 220;
hPC
hPC = 
                    Processing Module "conv"
                            ModuleGeneration: 'on'
                          LRNBlockGeneration: 'off'
                 SegmentationBlockGeneration: 'on'
                            ConvThreadNumber: 16
                             InputMemorySize: [227 227 3]
                            OutputMemorySize: [227 227 3]
                            FeatureSizeLimit: 2048

                      Processing Module "fc"
                            ModuleGeneration: 'on'
                      SoftmaxBlockGeneration: 'off'
                      SigmoidBlockGeneration: 'off'
                              FCThreadNumber: 4
                             InputMemorySize: 25088
                            OutputMemorySize: 4096

                  Processing Module "custom"
                            ModuleGeneration: 'on'
                                    Addition: 'on'
                              Multiplication: 'on'
                                    Resize2D: 'off'
                                     Sigmoid: 'off'
                                   TanhLayer: 'off'
                             InputMemorySize: 40
                            OutputMemorySize: 120

              Processor Top Level Properties
                              RunTimeControl: 'register'
                               RunTimeStatus: 'register'
                          InputStreamControl: 'register'
                         OutputStreamControl: 'register'
                                SetupControl: 'register'
                           ProcessorDataType: 'single'

                     System Level Properties
                              TargetPlatform: 'Xilinx Zynq UltraScale+ MPSoC ZCU102 Evaluation Kit'
                             TargetFrequency: 220
                               SynthesisTool: 'Xilinx Vivado'
                             ReferenceDesign: 'AXI-Stream DDR Memory Access : 3-AXIM'
                     SynthesisToolChipFamily: 'Zynq UltraScale+'
                     SynthesisToolDeviceName: 'xczu9eg-ffvb1156-2-e'
                    SynthesisToolPackageName: ''
                     SynthesisToolSpeedValue: ''

Estimate LogoNet Performance

To estimate the performance of the LogoNet series network, use the estimatePerformance function of the dlhdl.ProcessorConfig object. The function returns the estimated layer latency, network latency, and network performance in frames per second (Frames/s).

hPC.estimatePerformance(snet)
### Notice: The layer 'imageinput' of type 'ImageInputLayer' is split into an image input layer 'imageinput' and an addition layer 'imageinput_norm' for normalization on hardware.
### The network includes the following layers:
     1   'imageinput'    Image Input             227×227×3 images with 'zerocenter' normalization and 'randfliplr' augmentations  (SW Layer)
     2   'conv_1'        2-D Convolution         96 5×5×3 convolutions with stride [1  1] and padding [0  0  0  0]                (HW Layer)
     3   'relu_1'        ReLU                    ReLU                                                                             (HW Layer)
     4   'maxpool_1'     2-D Max Pooling         3×3 max pooling with stride [2  2] and padding [0  0  0  0]                      (HW Layer)
     5   'conv_2'        2-D Convolution         128 3×3×96 convolutions with stride [1  1] and padding [0  0  0  0]              (HW Layer)
     6   'relu_2'        ReLU                    ReLU                                                                             (HW Layer)
     7   'maxpool_2'     2-D Max Pooling         3×3 max pooling with stride [2  2] and padding [0  0  0  0]                      (HW Layer)
     8   'conv_3'        2-D Convolution         384 3×3×128 convolutions with stride [1  1] and padding [0  0  0  0]             (HW Layer)
     9   'relu_3'        ReLU                    ReLU                                                                             (HW Layer)
    10   'maxpool_3'     2-D Max Pooling         3×3 max pooling with stride [2  2] and padding [0  0  0  0]                      (HW Layer)
    11   'conv_4'        2-D Convolution         128 3×3×384 convolutions with stride [2  2] and padding [0  0  0  0]             (HW Layer)
    12   'relu_4'        ReLU                    ReLU                                                                             (HW Layer)
    13   'maxpool_4'     2-D Max Pooling         3×3 max pooling with stride [2  2] and padding [0  0  0  0]                      (HW Layer)
    14   'fc_1'          Fully Connected         2048 fully connected layer                                                       (HW Layer)
    15   'relu_5'        ReLU                    ReLU                                                                             (HW Layer)
    16   'fc_2'          Fully Connected         2048 fully connected layer                                                       (HW Layer)
    17   'relu_6'        ReLU                    ReLU                                                                             (HW Layer)
    18   'fc_3'          Fully Connected         32 fully connected layer                                                         (HW Layer)
    19   'softmax'       Softmax                 softmax                                                                          (SW Layer)
    20   'classoutput'   Classification Output   crossentropyex with 'adidas' and 31 other classes                                (SW Layer)
                                                                                                                                
### Notice: The layer 'softmax' with type 'nnet.cnn.layer.SoftmaxLayer' is implemented in software.
### Notice: The layer 'classoutput' with type 'nnet.cnn.layer.ClassificationOutputLayer' is implemented in software.


              Deep Learning Processor Estimator Performance Results

                   LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                         -------------             -------------              ---------        ---------       ---------
Network                   39199107                  0.17818                       1           39199107              5.6
    ____imageinput_norm     216472                  0.00098 
    ____conv_1             6832680                  0.03106 
    ____maxpool_1          3705912                  0.01685 
    ____conv_2            10454501                  0.04752 
    ____maxpool_2          1173810                  0.00534 
    ____conv_3             9364533                  0.04257 
    ____maxpool_3          1229970                  0.00559 
    ____conv_4             1759348                  0.00800 
    ____maxpool_4            24450                  0.00011 
    ____fc_1               2651288                  0.01205 
    ____fc_2               1696632                  0.00771 
    ____fc_3                 89511                  0.00041 
 * The clock frequency of the DL processor is: 220MHz

The estimated frames per second is 5.5 Frames/s. To improve the network performance, modify the custom processor convolution module kernel data type, convolution processor thread number, fully connected module kernel data type, and fully connected module thread number. For more information about these processor parameters, see getModuleProperty and setModuleProperty.

Create Modified Custom Processor Configuration

To create a custom processor configuration, use the dlhdl.ProcessorConfig object. For more information, see dlhdl.ProcessorConfig. To learn about modifiable parameters of the processor configuration, see getModuleProperty and setModuleProperty.

hPCNew = dlhdl.ProcessorConfig;
hPCNew.TargetFrequency = 300;
hPCNew.ProcessorDataType = 'int8';
hPCNew.setModuleProperty('conv', 'ConvThreadNumber', 64);
hPCNew.setModuleProperty('fc', 'FCThreadNumber',   16);
hPCNew
hPCNew = 
                    Processing Module "conv"
                            ModuleGeneration: 'on'
                          LRNBlockGeneration: 'off'
                 SegmentationBlockGeneration: 'on'
                            ConvThreadNumber: 64
                             InputMemorySize: [227 227 3]
                            OutputMemorySize: [227 227 3]
                            FeatureSizeLimit: 2048

                      Processing Module "fc"
                            ModuleGeneration: 'on'
                      SoftmaxBlockGeneration: 'off'
                      SigmoidBlockGeneration: 'off'
                              FCThreadNumber: 16
                             InputMemorySize: 25088
                            OutputMemorySize: 4096

                  Processing Module "custom"
                            ModuleGeneration: 'on'
                                    Addition: 'on'
                              Multiplication: 'on'
                                    Resize2D: 'off'
                                     Sigmoid: 'off'
                                   TanhLayer: 'off'
                             InputMemorySize: 40
                            OutputMemorySize: 120

              Processor Top Level Properties
                              RunTimeControl: 'register'
                               RunTimeStatus: 'register'
                          InputStreamControl: 'register'
                         OutputStreamControl: 'register'
                                SetupControl: 'register'
                           ProcessorDataType: 'int8'

                     System Level Properties
                              TargetPlatform: 'Xilinx Zynq UltraScale+ MPSoC ZCU102 Evaluation Kit'
                             TargetFrequency: 300
                               SynthesisTool: 'Xilinx Vivado'
                             ReferenceDesign: 'AXI-Stream DDR Memory Access : 3-AXIM'
                     SynthesisToolChipFamily: 'Zynq UltraScale+'
                     SynthesisToolDeviceName: 'xczu9eg-ffvb1156-2-e'
                    SynthesisToolPackageName: ''
                     SynthesisToolSpeedValue: ''

Quantize LogoNet Series Network

To quantize the LogoNet network, enter:

imageData = imageDatastore(fullfile(curDir,'logos_dataset'),...
 'IncludeSubfolders',true,'FileExtensions','.JPG','LabelSource','foldernames');
imageData_reduced = imageData.subset(1:20);
dlquantObj = dlquantizer(snet,'ExecutionEnvironment','FPGA');
dlquantObj.calibrate(imageData_reduced)

Estimate LogoNet Performance

To estimate the performance of the LogoNet series network, use the estimatePerformance function of the dlhdl.ProcessorConfig object. The function returns the estimated layer latency, network latency, and network performance in frames per second (Frames/s).

hPCNew.estimatePerformance(dlquantObj)
### The network includes the following layers:
     1   'imageinput'    Image Input             227×227×3 images with 'zerocenter' normalization and 'randfliplr' augmentations  (SW Layer)
     2   'conv_1'        2-D Convolution         96 5×5×3 convolutions with stride [1  1] and padding [0  0  0  0]                (HW Layer)
     3   'relu_1'        ReLU                    ReLU                                                                             (HW Layer)
     4   'maxpool_1'     2-D Max Pooling         3×3 max pooling with stride [2  2] and padding [0  0  0  0]                      (HW Layer)
     5   'conv_2'        2-D Convolution         128 3×3×96 convolutions with stride [1  1] and padding [0  0  0  0]              (HW Layer)
     6   'relu_2'        ReLU                    ReLU                                                                             (HW Layer)
     7   'maxpool_2'     2-D Max Pooling         3×3 max pooling with stride [2  2] and padding [0  0  0  0]                      (HW Layer)
     8   'conv_3'        2-D Convolution         384 3×3×128 convolutions with stride [1  1] and padding [0  0  0  0]             (HW Layer)
     9   'relu_3'        ReLU                    ReLU                                                                             (HW Layer)
    10   'maxpool_3'     2-D Max Pooling         3×3 max pooling with stride [2  2] and padding [0  0  0  0]                      (HW Layer)
    11   'conv_4'        2-D Convolution         128 3×3×384 convolutions with stride [2  2] and padding [0  0  0  0]             (HW Layer)
    12   'relu_4'        ReLU                    ReLU                                                                             (HW Layer)
    13   'maxpool_4'     2-D Max Pooling         3×3 max pooling with stride [2  2] and padding [0  0  0  0]                      (HW Layer)
    14   'fc_1'          Fully Connected         2048 fully connected layer                                                       (HW Layer)
    15   'relu_5'        ReLU                    ReLU                                                                             (HW Layer)
    16   'fc_2'          Fully Connected         2048 fully connected layer                                                       (HW Layer)
    17   'relu_6'        ReLU                    ReLU                                                                             (HW Layer)
    18   'fc_3'          Fully Connected         32 fully connected layer                                                         (HW Layer)
    19   'softmax'       Softmax                 softmax                                                                          (SW Layer)
    20   'classoutput'   Classification Output   crossentropyex with 'adidas' and 31 other classes                                (SW Layer)
                                                                                                                                
### Notice: The layer 'imageinput' with type 'nnet.cnn.layer.ImageInputLayer' is implemented in software.
### Notice: The layer 'softmax' with type 'nnet.cnn.layer.SoftmaxLayer' is implemented in software.
### Notice: The layer 'classoutput' with type 'nnet.cnn.layer.ClassificationOutputLayer' is implemented in software.


              Deep Learning Processor Estimator Performance Results

                   LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                         -------------             -------------              ---------        ---------       ---------
Network                   13829465                  0.04610                       1           13829465             21.7
    ____conv_1             3487680                  0.01163 
    ____maxpool_1          1852092                  0.00617 
    ____conv_2             2939191                  0.00980 
    ____maxpool_2           586689                  0.00196 
    ____conv_3             2577951                  0.00859 
    ____maxpool_3           614769                  0.00205 
    ____conv_4              611644                  0.00204 
    ____maxpool_4            12201                  0.00004 
    ____fc_1                665265                  0.00222 
    ____fc_2                425425                  0.00142 
    ____fc_3                 56558                  0.00019 
 * The clock frequency of the DL processor is: 300MHz

The estimated frames per second is 21.7 Frames/s.

Generate Custom Processor and Bitstream

Use the new custom processor configuration to build and generate a custom processor and bitstream. Use the custom bitstream to deploy the LogoNet network to your target FPGA board.

hdlsetuptoolpath('ToolName', 'Xilinx Vivado', 'ToolPath', 'C:\Xilinx\Vivado\2020.2\bin\vivado.bat');
dlhdl.buildProcessor(hPCNew);

To learn how to use the generated bitstream file, see Generate Custom Bitstream.

The generated bitstream in this example is similar to the zcu102_int8 bitstream. To deploy the quantized LogoNet network using the zcu102_int8 bitstream, see Classify Images on FPGA Using Quantized Neural Network.

See Also

| |

Related Topics