Speech Command Recognition by Using FPGA
This example shows how to deploy a custom pretrained series network that detects the presence of speech commands in audio to a Xilinx™ Zynq® UltraScale+™ MPSoC ZCU102 Evaluation Kit. This example uses the pretrained network that was trained by using the Speech Commands Dataset [1]. To create the pretrained network, see Train Speech Command Recognition Model Using Deep Learning.
Prerequisites
Deep Learning Toolbox™
Deep Learning HDL Toolbox™
Deep Learning HDL Toolbox™ Support Package for Xilinx FPGA and SoC
Audio Toolbox™
Xilinx™ Zynq® UltraScale+™ MPSoC ZCU102 Evaluation Kit
Load Speech Commands Data Set
This example uses the Google Speech Commands Dataset [1]. Download the dataset and untar the downloaded file. Set PathToDatabase
to the location of the data.
url = 'https://ssd.mathworks.com/supportfiles/audio/google_speech.zip'; downloadFolder = tempdir; dataFolder = fullfile(downloadFolder,'google_speech'); if ~exist(dataFolder,'dir') disp('Downloading data set (1.4 GB) ...') unzip(url,downloadFolder) end
Downloading data set (1.4 GB) ...
Load Pretrained Speech Recognition Network
The pretrained network trainedAudioNet
is a simple series network made up of 24 layers. The network uses max pooling layers to downsample the feature maps "spatially" (that is, in time and frequency) and a final max pooling layer that pools the input feature map globally over time. This enforces (approximate) time-translation invariance in the input spectrograms, allowing the network to perform the same classification independent of the exact position of the speech in time. Global pooling also significantly reduces the number of parameters in the final fully connected layer. To reduce the possibility of the network memorizing specific features of the training data, add a small amount of dropout to the input to the last fully connected layer.
The network is small, as it has only five convolutional layers with few filters. numF
controls the number of filters in the convolutional layers. To increase the accuracy of the network, try increasing the network depth by adding identical blocks of convolutional, batch normalization, and ReLU layers. You can also try increasing the number of convolutional filters by increasing numF
.
Use a weighted cross entropy classification loss. The weightedClassificationLayer
function creates a custom classification layer that calculates the cross entropy loss with observations weighted by classWeights
. Specify the class weights in the same order as the classes appear in categories(YTrain)
. To give each class equal total weight in the loss, use class weights that are inversely proportional to the number of training examples in each class. When using the Adam optimizer to train the network, the training algorithm is independent of the overall normalization of the class weights. Load the pretrained network trainedAudioNet
.
load('trainedAudioNet.mat');
Create Training and Validation Datastore
Create an audioDataStore
that points to the training and validation data sets. See audioDatastore
(Audio Toolbox).
ads = audioDatastore(fullfile(dataFolder, 'train'), ... 'IncludeSubfolders',true, ... 'FileExtensions','.wav', ... 'LabelSource','foldernames');
Specify the words that you want your model to recognize as commands. Label words that are not commands as unknown
. Labeling words that are not commands as unknown
creates a group of words that approximates the distribution of all words other than the commands. The network uses this group to learn the difference between commands and all other words.
To reduce the class imbalance between the known and unknown words and speed up processing, include only a fraction of the unknown words in the training set.
To create a datastore that contains only the commands and the subset of unknown words, Use subset
(Audio Toolbox) (Audio Toolbox). Count the number of examples belonging to each category.
commands = categorical(["yes","no","up","down","left","right","on","off","stop","go"]); isCommand = ismember(ads.Labels,commands); isUnknown = ~isCommand; includeFraction = 0.2; mask = rand(numel(ads.Labels),1) < includeFraction; isUnknown = isUnknown & mask; ads.Labels(isUnknown) = categorical("unknown"); adsTrain = subset(ads,isCommand|isUnknown); countEachLabel(adsTrain)
ans=11×2 table
Label Count
_______ _____
down 1842
go 1861
left 1839
no 1853
off 1839
on 1864
right 1852
stop 1885
unknown 6520
up 1843
yes 1860
ads = audioDatastore(fullfile(dataFolder, 'validation'), ... 'IncludeSubfolders',true, ... 'FileExtensions','.wav', ... 'LabelSource','foldernames'); isCommand = ismember(ads.Labels,commands); isUnknown = ~isCommand; includeFraction = 0.2; mask = rand(numel(ads.Labels),1) < includeFraction; isUnknown = isUnknown & mask; ads.Labels(isUnknown) = categorical("unknown"); adsValidation = subset(ads,isCommand|isUnknown); countEachLabel(adsValidation)
ans=11×2 table
Label Count
_______ _____
down 264
go 260
left 247
no 270
off 256
on 257
right 256
stop 246
unknown 883
up 260
yes 261
To train the network with the entire dataset and achieve the highest possible accuracy, set reduceDataset
to false
. To run this example quickly, set reduceDataset
to true
.
reduceDataset = false; if reduceDataset numUniqueLabels = numel(unique(adsTrain.Labels)); % Reduce the dataset by a factor of 20 adsTrain = splitEachLabel(adsTrain,round(numel(adsTrain.Files) / numUniqueLabels / 20)); adsValidation = splitEachLabel(adsValidation,round(numel(adsValidation.Files) / numUniqueLabels / 20)); end
Compute Auditory Spectrograms
To prepare the data for efficient training of a convolutional neural network, convert the speech waveforms to auditory-based spectrograms.
Define the parameters of the feature extraction. The segmentDuration
variable is the duration of each speech clip (in seconds). The frameDuration
variable is the duration of each frame for spectrum calculation. The hopDuration
variable is the time step between each spectrum. numBands
is the number of filters in the auditory spectrogram.
To perform the feature extraction, create an audioFeatureExtractor
(Audio Toolbox) (Audio Toolbox) object.
fs = 16e3; % Known sample rate of the data set. segmentDuration = 1; frameDuration = 0.025; hopDuration = 0.010; segmentSamples = round(segmentDuration*fs); frameSamples = round(frameDuration*fs); hopSamples = round(hopDuration*fs); overlapSamples = frameSamples - hopSamples; FFTLength = 512; numBands = 50; afe = audioFeatureExtractor( ... 'SampleRate',fs, ... 'FFTLength',FFTLength, ... 'Window',hann(frameSamples,'periodic'), ... 'OverlapLength',overlapSamples, ... 'barkSpectrum',true); setExtractorParameters(afe,'barkSpectrum','NumBands',numBands,'WindowNormalization',false);
Read a file from the dataset. Training a convolutional neural network requires input to be a consistent size. Some files in the data set are less than 1 second long. Apply zero-padding to the front and back of the audio signal so that it is of length segmentSamples
.
x = read(adsTrain); numSamples = size(x,1); numToPadFront = floor( (segmentSamples - numSamples)/2 ); numToPadBack = ceil( (segmentSamples - numSamples)/2 ); xPadded = [zeros(numToPadFront,1,'like',x);x;zeros(numToPadBack,1,'like',x)];
To extract audio features, call extract
. The output is a Bark spectrum with time across rows.
features = extract(afe,xPadded); [numHops,numFeatures] = size(features)
numHops = 98
numFeatures = 50
In this example, you post-process the auditory spectrogram by applying a logarithm. Taking a log of small numbers can lead to roundoff error.
To speed up processing, you can distribute the feature extraction across multiple workers by using parfor
.
First, determine the number of partitions for the dataset. If you do not have Parallel Computing Toolbox™, use a single partition.
if ~isempty(ver('parallel')) && ~reduceDataset pool = gcp; numPar = numpartitions(adsTrain,pool); else numPar = 1; end
Starting parallel pool (parpool) using the 'Processes' profile ... 03-Jan-2024 14:16:35: Job Queued. Waiting for parallel pool job with ID 1 to start ... Connected to parallel pool with 6 workers.
For each partition, read from the datastore, zero-pad the signal, and then extract the features.
parfor ii = 1:numPar subds = partition(adsTrain,numPar,ii); XTrain = zeros(numHops,numBands,1,numel(subds.Files)); for idx = 1:numel(subds.Files) x = read(subds); xPadded = [zeros(floor((segmentSamples-size(x,1))/2),1);x;zeros(ceil((segmentSamples-size(x,1))/2),1)]; XTrain(:,:,:,idx) = extract(afe,xPadded); end XTrainC{ii} = XTrain; end
Convert the output to a four-dimensional array that has auditory spectrograms along the fourth dimension.
XTrain = cat(4,XTrainC{:}); [numHops,numBands,numChannels,numSpec] = size(XTrain)
numHops = 98
numBands = 50
numChannels = 1
numSpec = 25058
To obtain data that has a smoother distribution, take the logarithm of the spectrograms by using a small offset.
epsil = 1e-6; XTrain = log10(XTrain + epsil);
Perform the feature extraction steps described above for the validation set.
if ~isempty(ver('parallel')) pool = gcp; numPar = numpartitions(adsValidation,pool); else numPar = 1; end parfor ii = 1:numPar subds = partition(adsValidation,numPar,ii); XValidation = zeros(numHops,numBands,1,numel(subds.Files)); for idx = 1:numel(subds.Files) x = read(subds); xPadded = [zeros(floor((segmentSamples-size(x,1))/2),1);x;zeros(ceil((segmentSamples-size(x,1))/2),1)]; XValidation(:,:,:,idx) = extract(afe,xPadded); end XValidationC{ii} = XValidation; end XValidation = cat(4,XValidationC{:}); XValidation = log10(XValidation + epsil);
Isolate the train and validation labels. Remove empty categories.
YTrain = removecats(adsTrain.Labels); YValidation = removecats(adsValidation.Labels);
Add Background Noise Data
The network must be able to recognize different spoken words and also to detect if the input contains silence or background noise.
To create samples of one-second clips of background noise, use the audio files in the _background
_ folder. Create an equal number of background clips from each background noise file. You can also create your own recordings of background noise and add them to the _background
_ folder. Before calculating the spectrograms, the function rescales each audio clip by using a factor sampled from a log-uniform distribution in the range provided by volumeRange
.
adsBkg = audioDatastore(fullfile(dataFolder, 'background')); numBkgClips = 4000; if reduceDataset numBkgClips = numBkgClips/20; end volumeRange = log10([1e-4,1]); numBkgFiles = numel(adsBkg.Files); numClipsPerFile = histcounts(1:numBkgClips,linspace(1,numBkgClips,numBkgFiles+1)); Xbkg = zeros(size(XTrain,1),size(XTrain,2),1,numBkgClips,'single'); bkgAll = readall(adsBkg); ind = 1; for count = 1:numBkgFiles bkg = bkgAll{count}; idxStart = randi(numel(bkg)-fs,numClipsPerFile(count),1); idxEnd = idxStart+fs-1; gain = 10.^((volumeRange(2)-volumeRange(1))*rand(numClipsPerFile(count),1) + volumeRange(1)); for j = 1:numClipsPerFile(count) x = bkg(idxStart(j):idxEnd(j))*gain(j); x = max(min(x,1),-1); Xbkg(:,:,:,ind) = extract(afe,x); if mod(ind,1000)==0 disp("Processed " + string(ind) + " background clips out of " + string(numBkgClips)) end ind = ind + 1; end end
Processed 1000 background clips out of 4000 Processed 2000 background clips out of 4000 Processed 3000 background clips out of 4000 Processed 4000 background clips out of 4000
Xbkg = log10(Xbkg + epsil);
Split the spectrograms of background noise among the training, validation, and test sets. Because the _background_noise
_ folder contains only about five and a half minutes of background noise, the background samples in the different data sets are highly correlated. To increase the variation in the background noise, you can create your own background files and add them to the folder. To increase the robustness of the network to noise, you can also try mixing background noise into the speech files.
numTrainBkg = floor(0.85*numBkgClips); numValidationBkg = floor(0.15*numBkgClips); XTrain(:,:,:,end+1:end+numTrainBkg) = Xbkg(:,:,:,1:numTrainBkg); YTrain(end+1:end+numTrainBkg) = "background"; XValidation(:,:,:,end+1:end+numValidationBkg) = Xbkg(:,:,:,numTrainBkg+1:end); YValidation(end+1:end+numValidationBkg) = "background";
Create Target Object
Create a target object for your target device that has a vendor name and an interface to connect your target device to the host computer. Interface options are JTAG (default) and Ethernet. Vendor options are Intel or Xilinx. Use the installed Xilinx Vivado Design Suite over an Ethernet connection to program the device.
hT = dlhdl.Target('Xilinx', Interface = 'Ethernet');
Create Workflow Object
Create an object of the dlhdl.Workflow
class. When you create the object, specify the network and the bitstream name. Specify the saved pretrained series network trainedAudioNet
as the network. Make sure that the bitstream name matches the data type and the FPGA board that you are targeting. In this example, the target FPGA board is the Zynq UltraScale+ MPSoC ZCU102 board. The bitstream uses a single data type.
hW = dlhdl.Workflow(Network = trainedNet, Bitstream = 'zcu102_single', Target = hT);
Compile trainedAudioNet
Network
To compile the trainedAudioNet
series network, run the compile function of the dlhdl.Workflo
w object.
compile(hW)
### Compiling network for Deep Learning FPGA prototyping ... ### Targeting FPGA bitstream zcu102_single. ### Optimizing network: Fused 'nnet.cnn.layer.BatchNormalizationLayer' into 'nnet.cnn.layer.Convolution2DLayer' ### Notice: The layer 'imageinput' of type 'ImageInputLayer' is split into an image input layer 'imageinput' and an addition layer 'imageinput_norm' for normalization on hardware. ### The network includes the following layers: 1 'imageinput' Image Input 98×50×1 images with 'zerocenter' normalization (SW Layer) 2 'conv_1' 2-D Convolution 12 3×3×1 convolutions with stride [1 1] and padding 'same' (HW Layer) 3 'relu_1' ReLU ReLU (HW Layer) 4 'maxpool_1' 2-D Max Pooling 3×3 max pooling with stride [2 2] and padding 'same' (HW Layer) 5 'conv_2' 2-D Convolution 24 3×3×12 convolutions with stride [1 1] and padding 'same' (HW Layer) 6 'relu_2' ReLU ReLU (HW Layer) 7 'maxpool_2' 2-D Max Pooling 3×3 max pooling with stride [2 2] and padding 'same' (HW Layer) 8 'conv_3' 2-D Convolution 48 3×3×24 convolutions with stride [1 1] and padding 'same' (HW Layer) 9 'relu_3' ReLU ReLU (HW Layer) 10 'maxpool_3' 2-D Max Pooling 3×3 max pooling with stride [2 2] and padding 'same' (HW Layer) 11 'conv_4' 2-D Convolution 48 3×3×48 convolutions with stride [1 1] and padding 'same' (HW Layer) 12 'relu_4' ReLU ReLU (HW Layer) 13 'conv_5' 2-D Convolution 48 3×3×48 convolutions with stride [1 1] and padding 'same' (HW Layer) 14 'relu_5' ReLU ReLU (HW Layer) 15 'maxpool_4' 2-D Max Pooling 13×1 max pooling with stride [1 1] and padding [0 0 0 0] (HW Layer) 16 'fc' Fully Connected 12 fully connected layer (HW Layer) 17 'softmax' Softmax softmax (SW Layer) 18 'classoutput' Classification Output Weighted cross entropy (SW Layer) ### Notice: The layer 'softmax' with type 'nnet.cnn.layer.SoftmaxLayer' is implemented in software. ### Notice: The layer 'classoutput' with type 'weightedClassificationLayer' is implemented in software. ### Compiling layer group: conv_1>>relu_5 ... ### Compiling layer group: conv_1>>relu_5 ... complete. ### Compiling layer group: maxpool_4 ... ### Compiling layer group: maxpool_4 ... complete. ### Compiling layer group: fc ... ### Compiling layer group: fc ... complete. ### Allocating external memory buffers: offset_name offset_address allocated_space _______________________ ______________ __________________ "InputDataOffset" "0x00000000" "2.2 MB" "OutputResultOffset" "0x0023f000" "4.0 kB" "SchedulerDataOffset" "0x00240000" "336.0 kB" "SystemBufferOffset" "0x00294000" "484.0 kB" "InstructionDataOffset" "0x0030d000" "196.0 kB" "ConvWeightDataOffset" "0x0033e000" "224.0 kB" "FCWeightDataOffset" "0x00376000" "16.0 kB" "EndOffset" "0x0037a000" "Total: 3560.0 kB" ### Network compilation complete.
ans = struct with fields:
weights: [1×1 struct]
instructions: [1×1 struct]
registers: [1×1 struct]
syncInstructions: [1×1 struct]
constantData: {{} [1×19600 single]}
ddrInfo: [1×1 struct]
resourceTable: [6×2 table]
Program Bitstream onto FPGA and Download Network Weights
To deploy the network on the Zynq® UltraScale+™ MPSoC ZCU102 hardware, run the deploy function of the dlhdl.Workflow
object. This function uses the output of the compile function to program the FPGA board by using the programming file.The function also downloads the network weights and biases. The deploy function verifies the Xilinx Vivado tool and the supported tool version. It then starts programming the FPGA device by using the bitstream, displays progress messages, and the time it takes to deploy the network.
deploy(hW)
### Programming FPGA Bitstream using Ethernet... ### Attempting to connect to the hardware board at 192.168.1.101... ### Connection successful ### Programming FPGA device on Xilinx SoC hardware board at 192.168.1.101... ### Copying FPGA programming files to SD card... ### Setting FPGA bitstream and devicetree for boot... # Copying Bitstream zcu102_single.bit to /mnt/hdlcoder_rd # Set Bitstream to hdlcoder_rd/zcu102_single.bit # Copying Devicetree devicetree_dlhdl.dtb to /mnt/hdlcoder_rd # Set Devicetree to hdlcoder_rd/devicetree_dlhdl.dtb # Set up boot for Reference Design: 'AXI-Stream DDR Memory Access : 3-AXIM' ### Rebooting Xilinx SoC at 192.168.1.101... ### Reboot may take several seconds... ### Attempting to connect to the hardware board at 192.168.1.101... ### Connection successful ### Programming the FPGA bitstream has been completed successfully. ### Loading weights to Conv Processor. ### Conv Weights loaded. Current time is 03-Jan-2024 14:19:08 ### Loading weights to FC Processor. ### FC Weights loaded. Current time is 03-Jan-2024 14:19:08
Run Prediction on Audio Files
Classify five inputs from the validation data set and compare the prediction results to the classification results from the Deep Learning Toolbox™. YPred
is the classification result from the Deep learning Toolbox™. The fpga_prediction
variable is the classification result from the FPGA.
numtestFrames = size(XValidation,4);
numView = 5;
listIndex = randperm(numtestFrames,numView);
testDataBatch = XValidation(:,:,:,listIndex);
YPred = classify(trainedNet,testDataBatch);
[scores,speed] = predict(hW,testDataBatch, Profile ='on');
### Finished writing input activations. ### Running in multi-frame mode with 5 inputs. Deep Learning Processor Profiler Performance Results LastFrameLatency(cycles) LastFrameLatency(seconds) FramesNum Total Latency Frames/s ------------- ------------- --------- --------- --------- Network 318982 0.00145 5 1544587 712.2 imageinput_norm 43337 0.00020 conv_1 47867 0.00022 maxpool_1 37802 0.00017 conv_2 45553 0.00021 maxpool_2 21443 0.00010 conv_3 39147 0.00018 maxpool_3 16390 0.00007 conv_4 28150 0.00013 conv_5 28730 0.00013 maxpool_4 8027 0.00004 fc 2496 0.00001 * The clock frequency of the DL processor is: 220MHz
[~,idx] = max(scores,[],2); fpga_prediction = trainedNet.Layers(end).Classes(idx);
Compare the prediction results from Deep Learning Toolbox™ and the FPGA side by side. The prediction results from the FPGA match the prediction results from Deep Learning Toolbox™. In this table, the ground truth prediction is the Deep Learning Toolbox™ prediction.
fprintf('%12s %24s\n','Ground Truth','FPGA Prediction');for i= 1:size(fpga_prediction,1) fprintf('%s %24s\n',YPred(i),fpga_prediction(i)); end
Ground Truth FPGA Prediction
background background yes yes background background background background no no
References
[1] Warden P. "Speech Commands: A public dataset for single-word speech recognition", 2017. Available at http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz. Copyright Google 2017. The Speech Commands Dataset is licensed under the Creative Commons Attribution 4.0 license, available here: https://creativecommons.org/licenses/by/4.0/legalcode.
MathWorks, Inc.
See Also
dlhdl.Target
| dlhdl.Workflow
| compile
| deploy
| predict
| classify