Main Content

Estimate Resource Utilization for Custom Processor Configuration

To estimate the resource utilization of a custom processor configuration, compare resource utilization for a custom processor configuration to resource utilization of a reference (shipping) bitstream processor configuration. Analyze the effects of custom deep learning processor parameters on resource utilization.

Estimate Resource Utilization

Calculate resource utilization for a custom processor configuration.

  1. Create a dlhdl.ProcessorConfig object.

    hPC = dlhdl.ProcessorConfig
    hPC = 
    
                        Processing Module "conv"
                                ConvThreadNumber: 16
                                 InputMemorySize: [227  227    3]
                                OutputMemorySize: [227  227    3]
                                FeatureSizeLimit: 2048
                                  KernelDataType: 'single'
    
                          Processing Module "fc"
                                  FCThreadNumber: 4
                                 InputMemorySize: 25088
                                OutputMemorySize: 4096
                                  KernelDataType: 'single'
    
                       Processing Module "adder"
                                 InputMemorySize: 40
                                OutputMemorySize: 40
                                  KernelDataType: 'single'
    
                         System Level Properties
                                  TargetPlatform: 'Xilinx Zynq UltraScale+ MPSoC ZCU102 Evaluation Kit'
                                 TargetFrequency: 200
                                   SynthesisTool: 'Xilinx Vivado'
                                 ReferenceDesign: 'AXI-Stream DDR Memory Access : 3-AXIM'
                         SynthesisToolChipFamily: 'Zynq UltraScale+'
                         SynthesisToolDeviceName: 'xczu9eg-ffvb1156-2-e'
                        SynthesisToolPackageName: ''
                         SynthesisToolSpeedValue: ''
  2. Call estimateResources to retrieve resource utilization.

    hPC.estimateResources
    
                  Deep Learning Processor Estimator Resource Results
    
                                 DSPs          Block RAM*     
                            -------------    -------------    
    DL_Processor                     368              524 		 
        conv_module                  343              481 		 
        fc_module                     17               34 		 
        adder_module                   8                6 		 
        debug_module                   0                2 		 
        sched_module                   0                1 		 
     * Block RAM represents Block RAM tiles in Xilinx devices and Block RAM bits in Intel devices

    The returned table contains resource utilization for the entire processor and individual modules.

Customize Bitstream Configuration to Meet Resource Use Requirements

The user wants to deploy a digit recognition network with a target performance of 500 frames per second (FPS) to a Xilinx ZCU102 ZU4CG device. The target device resource counts are:

  • Digital signal processor (DSP) slice count - 240

  • Block random access memory (BRAM) count -128

The reference (shipping) zcu102_int8 bitstream configuration is for a Xilinx ZCU102 ZU9EG device. The default board resource counts are:

  • Digital signal processor (DSP) slice count - 2520

  • Block random access memory (BRAM) count -912

The default board resource counts exceed the user resource budget and is on the higher end of the cost spectrum. You can achieve target performance and resource use budget by quantizing the target deep learning network and customizing the custom default bitstream configuration.

In this example create a custom bitstream configuration to match your resource budget and performance requirements.

Prerequisites

  • Deep Learning HDL Toolbox™ Support Package for Xilinx FPGA and SoC

  • Deep Learning Toolbox™

  • Deep Learning HDL Toolbox™

  • Deep Learning Toolbox Model Quantization Library

Load Pretrained Network

To load the pretrained series network, that has been trained on the Modified National Institute Standards of Technology (MNIST) database, enter:

snet = getDigitsNetwork;

Quantize Network

To quantize the MNIST based digits network, enter:

dlquantObj = dlquantizer(snet,'ExecutionEnvironment','FPGA');
Image = imageDatastore('five_28x28.pgm','Labels','five');
dlquantObj.calibrate(Image)
ans=21×5 table
        Optimized Layer Name        Network Layer Name    Learnables / Activations    MinValue     MaxValue
    ____________________________    __________________    ________________________    _________    ________

    {'conv_1_Weights'          }     {'batchnorm_1'}           "Weights"              -0.017061    0.013648
    {'conv_1_Bias'             }     {'batchnorm_1'}           "Bias"                 -0.025344    0.058799
    {'conv_2_Weights'          }     {'batchnorm_2'}           "Weights"               -0.54744     0.51019
    {'conv_2_Bias'             }     {'batchnorm_2'}           "Bias"                   -1.1787      1.0515
    {'conv_3_Weights'          }     {'batchnorm_3'}           "Weights"               -0.39927     0.44173
    {'conv_3_Bias'             }     {'batchnorm_3'}           "Bias"                  -0.85118      1.1321
    {'fc_Weights'              }     {'fc'         }           "Weights"               -0.22558     0.29637
    {'fc_Bias'                 }     {'fc'         }           "Bias"                 -0.011837    0.016848
    {'imageinput'              }     {'imageinput' }           "Activations"                  0         255
    {'imageinput_normalization'}     {'imageinput' }           "Activations"            -22.566      232.43
    {'conv_1'                  }     {'batchnorm_1'}           "Activations"            -7.9196      6.7861
    {'relu_1'                  }     {'relu_1'     }           "Activations"                  0      6.7861
    {'maxpool_1'               }     {'maxpool_1'  }           "Activations"                  0      6.7861
    {'conv_2'                  }     {'batchnorm_2'}           "Activations"            -8.4641      7.2347
    {'relu_2'                  }     {'relu_2'     }           "Activations"                  0      7.2347
    {'maxpool_2'               }     {'maxpool_2'  }           "Activations"                  0      7.2347
      ⋮

Retrieve zcu102_int Bitstream Configuration

To retrieve the zcu102_int8 bitstream configuration, use the dlhdl.ProcessorConfig object. For more information, see dlhdl.ProcessorConfig. To learn about modifiable parameters of the processor configuration, see getModuleProperty and setModuleProperty.

hPC_reference = dlhdl.ProcessorConfig('Bitstream','zcu102_int8')
hPC_reference = 
                    Processing Module "conv"
                            ConvThreadNumber: 64
                             InputMemorySize: [227  227    3]
                            OutputMemorySize: [227  227    3]
                            FeatureSizeLimit: 2048
                              KernelDataType: 'int8'

                      Processing Module "fc"
                              FCThreadNumber: 16
                             InputMemorySize: 25088
                            OutputMemorySize: 4096
                              KernelDataType: 'int8'

                   Processing Module "adder"
                             InputMemorySize: 40
                            OutputMemorySize: 40
                              KernelDataType: 'int8'

                     System Level Properties
                              TargetPlatform: 'Xilinx Zynq UltraScale+ MPSoC ZCU102 Evaluation Kit'
                             TargetFrequency: 250
                               SynthesisTool: 'Xilinx Vivado'
                             ReferenceDesign: 'AXI-Stream DDR Memory Access : 3-AXIM'
                     SynthesisToolChipFamily: 'Zynq UltraScale+'
                     SynthesisToolDeviceName: 'xczu9eg-ffvb1156-2-e'
                    SynthesisToolPackageName: ''
                     SynthesisToolSpeedValue: ''

Estimate Network Performance and Resource Utilization for zcu102_int8 Bitstream Configuration

To estimate the performance of the digits series network, use the estimatePerformance function of the dlhdl.ProcessorConfig object. The function returns the estimated layer latency, network latency, and network performance in frames per second (Frames/s).

To estimate the resource use of the zcu102_int8 bitstream, use the estimateResources function of the dlhdl.ProcessorConfig object. The function returns the estimated DSP slice and BRAM usage.

hPC_reference.estimatePerformance(dlquantObj)
### Optimizing series network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer'
3 Memory Regions created.



              Deep Learning Processor Estimator Performance Results

                   LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                         -------------             -------------              ---------        ---------       ---------
Network                      57955                  0.00023                       1              57955           4313.7
    ____conv_1                4391                  0.00002 
    ____maxpool_1             2877                  0.00001 
    ____conv_2                2351                  0.00001 
    ____maxpool_2             2265                  0.00001 
    ____conv_3                2507                  0.00001 
    ____fc                   43564                  0.00017 
 * The clock frequency of the DL processor is: 250MHz
hPC_reference.estimateResources
              Deep Learning Processor Estimator Resource Results

                             DSPs          Block RAM*     
                        -------------    -------------    
DL_Processor                     768              386 		 
    conv_module                  647              315 		 
    fc_module                     97               50 		 
    adder_module                  24               12 		 
    debug_module                   0                8 		 
    sched_module                   0                1 		 
* Block RAM represents Block RAM tiles in Xilinx devices and Block RAM bits in Intel devices

The estimated performance is 4314 FPS and the estimated resource use counts are:

  • Digital signal processor (DSP) slice count - 768

  • Block random access memory (BRAM) count -386

The estimated DSP slice count and BRAM count use exceeds the target device resource budget. Customize the bitstream configuration to reduce resource use.

Create Custom Bitstream Configuration

To create a custom processor configuration, use the dlhdl.ProcessorConfig object. For more information, see dlhdl.ProcessorConfig. To learn about modifiable parameters of the processor configuration, see getModuleProperty and setModuleProperty.

To reduce the resource use for the custom bitstream, modify the KernelDataType for the conv, fc, and adder modules. Modify the ConvThreadNumber to reduce DSP slice count. Reduce the InputMemorySize and OutputMemorySize for the conv module to reduce BRAM count.

hPC_custom = dlhdl.ProcessorConfig;
hPC_custom.setModuleProperty('conv','KernelDataType','int8');
hPC_custom.setModuleProperty('fc','KernelDataType','int8');
hPC_custom.setModuleProperty('adder','KernelDataType','int8');
hPC_custom.setModuleProperty('conv','ConvThreadNumber',4);
hPC_custom.setModuleProperty('conv','InputMemorySize',[30 30 1]);
hPC_custom.setModuleProperty('conv','OutputMemorySize',[30 30 1]);
hPC_custom
hPC_custom = 
                    Processing Module "conv"
                            ConvThreadNumber: 4
                             InputMemorySize: [30  30   1]
                            OutputMemorySize: [30  30   1]
                            FeatureSizeLimit: 2048
                              KernelDataType: 'int8'

                      Processing Module "fc"
                              FCThreadNumber: 4
                             InputMemorySize: 25088
                            OutputMemorySize: 4096
                              KernelDataType: 'int8'

                   Processing Module "adder"
                             InputMemorySize: 40
                            OutputMemorySize: 40
                              KernelDataType: 'int8'

                     System Level Properties
                              TargetPlatform: 'Xilinx Zynq UltraScale+ MPSoC ZCU102 Evaluation Kit'
                             TargetFrequency: 200
                               SynthesisTool: 'Xilinx Vivado'
                             ReferenceDesign: 'AXI-Stream DDR Memory Access : 3-AXIM'
                     SynthesisToolChipFamily: 'Zynq UltraScale+'
                     SynthesisToolDeviceName: 'xczu9eg-ffvb1156-2-e'
                    SynthesisToolPackageName: ''
                     SynthesisToolSpeedValue: ''

Estimate Network Performance and Resource Utilization for Custom Bitstream Configuration

To estimate the performance of the digits series network, use the estimatePerformance function of the dlhdl.ProcessorConfig object. The function returns the estimated layer latency, network latency, and network performance in frames per second (Frames/s).

To estimate the resource use of the hPC_custom bitstream, use the estimateResources function of the dlhdl.ProcessorConfig object. The function returns the estimated DSP slice and BRAM usage.

hPC_custom.estimatePerformance(dlquantObj)
### Optimizing series network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer'
3 Memory Regions created.



              Deep Learning Processor Estimator Performance Results

                   LastFrameLatency(cycles)   LastFrameLatency(seconds)       FramesNum      Total Latency     Frames/s
                         -------------             -------------              ---------        ---------       ---------
Network                     348511                  0.00174                       1             348511            573.9
    ____conv_1               27250                  0.00014 
    ____maxpool_1            42337                  0.00021 
    ____conv_2               45869                  0.00023 
    ____maxpool_2            68153                  0.00034 
    ____conv_3              121493                  0.00061 
    ____fc                   43409                  0.00022 
 * The clock frequency of the DL processor is: 200MHz
hPC_custom.estimateResources
              Deep Learning Processor Estimator Resource Results

                             DSPs          Block RAM*     
                        -------------    -------------    
DL_Processor                     120              108 		 
    conv_module                   89               63 		 
    fc_module                     25               33 		 
    adder_module                   6                3 		 
    debug_module                   0                8 		 
    sched_module                   0                1 		 
* Block RAM represents Block RAM tiles in Xilinx devices and Block RAM bits in Intel devices

The estimated performance is 574 FPS and the estimated resource use counts are:

  • Digital signal processor (DSP) slice count - 120

  • Block random access memory (BRAM) count -108

The estimated resources of the customized bitstream match the user target device resource budget and the estimated performance matches the target network performance.

See Also

| | | |

Related Topics