Using Lookup Tables to Accelerate Deep Learning Inference
This video highlights the lookup table optimization capability to generate an efficient lookup table for a sigmoid function, which is a key activation function used in deep learning networks. We then compare the relative speedup on an Arduino Due® and STMicroelectronics® discovery board using the generated code for hardware in the loop simulation.
Published: 19 Nov 2019
A lookup table is a key construct for embedded designs, and is often used to speed up the run-time execution of certain functions of your algorithm. For instance, complex trig functions are often replaced with a more efficient LUT implementation.
Let’s try a simple experiment – applying the same principle to the sigmoid function to investigate how we can accelerate the deep learning inference performance particularly on the edge.
The sigmoid function is a key building block for neural networks and is one of the commonly used nonlinear activation functions used in deep learning networks.
Here we have a simple Simulink subsystem that models the sigmoid function. I am going to use the Lookup Table Optimizer app to generate an optimal LUT, specifying the input and output data types. Since this is a bounded function, I can specify the bounds on the output and finally the tolerance on the output of 1%.
Once the optimization problem is solved, we can look at the comparison plot to verify that the error of the LUT approximation is within our specified tolerance.
Now as a next step, lets generate C code from the sigmoid function and the generated LUT and deploy it to a cortex M platform like the Arduino board.
We use hardware-in-the-loop simulation to run the generated code with inputs from Simulink. There is some overhead of running the code in this mode but this still gives us a good comparison of the relative execution speed.
As you can see from the execution profile, the LUT is 2.5 x faster on the Arduino. I repeated the same test on a Cortex M7 based STMicro discovery board. Here is a plot showing the relative speedup the lookup table with different data types.
In fact, this can scale up if you can share the lookup table approximation between all neurons, further decreasing the execution speed by orders of magnitude. You can do the same experiment with other activation functions like hyperbolic tangent.
To learn more about optimizing LUTs in your design, please refer to additional links below the video.